It Is About Time We Start Data Mining Mobile Phones

May 28, 2013

One of the main areas that companies are failing to collect data on is mobile phones. Interestingly enough, Technology Review has this article to offer the informed reader: “Released: A Trove Of Cell Of Cell Phone Data-Mining Research.” Cell phone data offers a plethora of opportunity, one that is only starting to be used to its full potential. It is not just the more developed countries that can use the data, but developing countries as well could benefit. It has been noted that cell phones could be used to redesign transportation networks and even create some eye-opening situations in epidemiology.

There is a global wide endeavor to understand cell phone data ramifications:

“Ahead of a conference on the topic that starts Wednesday at MIT, a mother lode of research has been made public about how to use this data. For the past year, researchers around the world responded to a challenge dubbed Data for Development, in which the telecom giant Orange released 2.5 billion records from five million cell-phone users in Ivory Coast. A compendium of this work is the D4D book, holding all 850 pages of the submissions. The larger conference, called NetMob (now in its third year), also features papers based on cell phone data from other regions, described in this book of abstracts.”

Before you get too excited, take note that privacy concerns are an important issue. No one has found a reasonable way to disassociate users with their cell phone data. It will only be a matter of time before that happens, until then we can abound in the possibilities.

Whitney Grace, May 28, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Harnessing The Power Of Raw Public Data

May 28, 2013

The Internet allows multiple data streams to converge and release their data to end users, but very few people know how to explicitly use the public data much less on how to find it. There is a solution reports TechCrunch in the article, “Enigma Makes Unearthing And Sifting Through Public Data A Breeze.” Enigma is a New York startup with Hicham Oudghiri, Marc Dacosta, and CEO Jeremy Bronfmann on the team. The company’s software pulls data from over 100,000 public data sources and it pools the data in easy-to-read tables.

“That’s all very neat, but how does Enigma do it? The data itself comes from a host of places, but most of Enigma’s government data was obtained by issuing a Freedom of Information Act request to the U.S. General Services Administration for all the top level .gov domains. From there the team uses crawlers to download all the databases it can find, and algorithmically finds connections between all those data points to create a sort of public knowledge graph. Whenever you search for a term on Enigma, Enigma actually searches around that term to figure out and display whatever applicable data sets it can find.”

Enigma should be seen more as an infrastructure search solution and the company heads believe it could become an integral part of the Internet in five years. As a tool, it has many benefits for researchers and already it has made partnerships with the New York Times, Capital IQ, S&P Capital, Gerson Lehrman Group, and the Harvard Business School. The startup company is an enterprise at the moment, but there are possible plans for a free version in the future. Enigma pulls all its data from public resources, but it must comply with laws and regulations that come with the information. Enigma wants to play by the rules, but by playing within the bounds it hopes to become a dispenseless tool.

Whitney Grace, May 28, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Going Beyond ETL

May 25, 2013

Traditional data warehousing, or as it is often called, Extract, Transform and Load (ETL) have constituted an important enterprise software category. Now, these capabilities are built into products solving other data needs. The article, “Talend Ships Version 5.3 of Data Integration Platform” talks Talend’s new platform.

Talend specializes in Data Integration and offers an open source distribution of its platform. Beyond ETL it features Master Data Management, Data Quality, Business Process Integration and Enterprise Service Bus.

The invariable question in this area is what about Hadoop?

“Talend version 5.3 now features a graphical mapper for building Apache Pig data transformation scripts visually (rather than having to code the data flows in the component’s language, “Pig Latin”), thus making an important Hadoop stack component a bit more analyst-friendly. Talend 5.3 can also generate native Java MapReduce code, which allows data transformations to run right on the Hadoop cluster, avoiding burdensome data movement, and making use of general purpose SQL and import/export tools like Hive and Sqoop unnecessary.”

The rest of the article mentions that Talend beefs up its connectors. They have added support for Couchbase, CouchDB, and Neo4j. Now they offer connectivity to databases across all four major NoSQL categories.

Megan Feil, May 25, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

LucidWorks Raises 10 Million in Capital

May 23, 2013

LucidWorks continues to raise revenue, helping the company build and support open source software that empowers organizations to manage their multi-structured data. Venture Beat covers this latest round of venture capital in their story, “LucidWorks Pulls in $10M to Turn Open Source Data Into ‘Business Gold.’”

The articles states:

“‘Big data’ startup LucidWorks has raised $10 million to help enterprise companies ‘turn multistructured data into business gold’ . . . According to a form filed with the SEC, existing investors Shasta Ventures, Granite Ventures, and Walden International contributed to this third round of funding. It brings LucidWorks’ total capital raised to $26 million.”

The company employs one-fourth of the committers on the Apache Lucene/Solr project, upon which their LucidWorks Search and LucidWorks Big Data offerings are built. Big customers include AT&T, Elsevier, Cisco, Nike, Sears, and Ford, among others. The company is truly doing well, and this additional capital will help improve their scope and reach. Their support offerings set them apart from the pack, and their investment in open source is sincere, sponsoring multiple training and development events across the country. If they stay on this path, good things will continue to happen to LucidWorks.

Emily Rae Aldridge, May 23, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Phone Data Value And What Companies Are Doing With It

May 23, 2013

Smartphones are an extension of a person’s life and they record it every time a person uses it. Smithsonian Magazine takes a look at how phone companies are tracking and using the data from phones in, “What Phone Companies Are Doing With All That Data From Your Phone.” Verizon Wireless is aware of the phone data goldmine and has added a new division called Precision Market Insights and Telefonica is adding a new business unit Telefonica Dynamic Insights to do the same thing. Phone data is being used for market, medical, and social science research. The biggest usage is tracking how people move in real time. The data collected is supposed to remain anonymous, but that is not happening.

People can be tracked:

“But a study published in Scientific Reports in March found that even data made anonymous may not be so anonymous after all. A team of researchers from Louvain University in Belgium, Harvard and M.I.T. found that by using data from 15 months of phone use by 1.5 million people, together with a similar dataset from Foursquare, they could identify about 95 percent of the cell phones users with just four data points and 50 percent of them with just two data points. A data point is an individual’s approximate whereabouts at the approximate time they’re using their cell phone.”

People’s travel and cell phone patterns are repetitive and unique, making it easy to narrow down results to an individual user. Anonymity is a hard thing to achieve with a smartphone. To confuse the data, a person could get two mobile phones, but then does that increase the fun or increase the risk?

Whitney Grace, May 23, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Potentially New Web Page Data Mining Tool

May 22, 2013

Extracting content from a Web page can be a maddening process, requiring specialized scripts and time spent coding them. Taking a look at available tools, Softpedia touts “FMiner Pro 7.05.” FMiner Pro is advertised as a reliable application that allows users to easily handle Web content without scripts. The software can pull data from any page type, including https, plugins, JavaScript, and even complete data structures.

After the data is extracted much can be done with it:

“Extracted results can be saved to csv, Excel(xls), SQLite, Access, SQL Server, MySQL, PostgreSQL, and can specify the database fields’ types and attributes(eg, UNIQUE can avoid duplication of the extracted data). According to the setting, program can build, rebuild or load the database structure, and save the data to an existing database. Professional edition support incremental extraction, clear extraction and schedule extraction.”

FMiner Pro is available for a free fifteen-day trial to see how well it can perform. After viewing the specs, FMiner Pro is worth a shot. It can probably save coders hours by not having to write scripts and organizing Web content is a tedious job no one likes to do. Having a program to do it is much more preferable.

Whitney Grace, May 22, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Generational Generation of Digital Information–a Breakdown

May 21, 2013

In the article Digital Footprints Broken Down by Generation on Bit Rebels, there are some very interesting facts laid out about the amount of data generated everyday by humans. According to the article, citing “science”, the internet in all its encompassing hugeness weighs no more than the average strawberry, but the data printed out could cross the US and China fifteen times. To discover your data footprint and get a label (such as super-user) you can visit Cisco’s website What is Your Digital Footprint. The article also states,

“What I found to be the most staggering stat is that from the beginning of time until 2003, humans generated 5 billion gigabytes worth of data. Today, we generate that much data every two days. In a year from now, we will generate that much data every ten minutes. What will it be like in 10 years from now? Doesn’t it seem like at some point it would get full? Science is full of mystery and wonder.”

For a breakdown of generational usage of everything from tv to smart phones to desktop computers, examine this handy infographic by Wikibon. It predicts that soon we will generate 5 billion gigabytes of data every ten minutes. Some of this data is private, and at ArnoldIT you can learn from a team of professionals about how to avoid leaving a digital footprint that will open you up to risk and embarrassment.

Chelsea Kerwin, May 21, 2013

If you are interested in gourmet food and spirits, read Gourmet De Ville.

A Fresh Look at Big Data

May 8, 2013

Next week I am doing an invited talk in London. My subject is search and Big Data. I will be digging into this notion in this month’s Honk newsletter and adding some business intelligence related comments at an Information Today conference in New York later this month. (I have chopped the number of talks I am giving this year because at my age air travel and the number of 20 somethings at certain programs makes me jumpy.)

I want to highlight one point in my upcoming London talk; namely, the financial challenge which companies face when they embrace Big Data and then want to search the information in the system and search the Big Data system’s outputs.

Here are the simplified curves:

image

Notice that precision and recall has not improved significantly over the last 30 years. I anticipate that many search vendors will tell me that their systems deliver excellent precision and recall. I am not convinced. The data which I have reviewed show that over a period of 10 years most systems hit the 80 to 85 percent precision and recall level for content which is about a topic. Content collections composed of scientific, technical, and medical information where the terminology is reasonably constrained can do better. I have seen scores above 90 percent. However, for general collections, precision and recall has not been improving relative to the advances in other disciplines; for example, converting structured data outputs to fancy graphics.

Read more

Hadoop Heats Up with New Startups

May 8, 2013

While there is some controversy over whether Hadoop is the only necessary tool to mine opportunities from big data, Hadoop and insights from big data seem to be synonymous according to Datamation’s recent article. They give us the rundown on “Seven Hot Hadoop Startups that Will Tame Big Data.”

According to this article, the current Hadoop ecosytem market is worth around $77 million. With growth, the value is projected to be at $813 million by 2016. The article notes that Hadoop has not been proven as completely effective in the enterprise world. Queries are still a weak point.

The article discusses seven startups that intend on seeing Hadoop through into maturity like Alpine Data Labs. The following excerpt explains why they are on this list:

“According to Alpine Data, part of the problem is that it’s much too difficult to get real insights out of Hadoop and other parallel platforms. Most companies don’t know what to do with massive datasets, and few have gotten any further with Hadoop than batch processing and basic querying. Alpine Data set out to simplify machine-learning methods and make them available on petabyte-scale datasets. Their tools make these methods available in a lightweight web application with a code-free, drag-and-drop interface.”

With the amount of attention on Hadoop over the years, Hadoop start ups are not a commodity. A list featuring a selection of the new ones to watch is much appreciated. Check out the full and useful list of hot Hadoop start ups.

Megan Feil, May 08, 2013

Sponsored by ArnoldIT.com, developer of Beyond Search

Inventive Graduate Student Builds Breakthrough Database

April 30, 2013

For some folks, deadlines can lead to innovation. One graduate student’s efforts to speed up his research has resulted in the inspired, high-speed parallel database MapD, we learn from DataInformed‘s encouraging piece, “Fast Database Emerges from MIT Class, GPUs and Student’s Invention.” Todd Mostak’s in-a-pinch breakthrough could soon help others in business as well as academia.

The informative article contains too many specifics to cover here, but I suggest checking it out. It should be fascinating reading for anyone interested in data management. I personally think the use of graphics processors designed for gaming is a stroke of genius. Or maybe desperation (the two can be closely related). Reporter Ian B. Murphy tells us:

“While taking a class on databases at MIT, Mostak built a new parallel database, called MapD, that allows him to crunch complex spatial and GIS data in milliseconds, using off-the-shelf gaming graphical processing units (GPU) like a rack of mini supercomputers. Mostak reports performance gains upwards of 70 times faster than CPU-based systems. . . .

“‘I had the realization that this had the potential to be majorly disruptive,’ Mostak said. ‘There have been all these little research pieces about this algorithm or that algorithm on the GPU, but I thought, “Somebody needs to make an end-to-end system.” I was shocked that it really hadn’t been done.'”

Well, sometimes it takes someone from outside a field to see what seems obvious in retrospect. Mostak’s undergraduate experience was in economics, anthropology, and math, and he was in Harvard’s Middle Eastern Studies program when he was compelled to develop MapD. A database class at MITgave him the knowledge he needed to build this tool, which he created to help with the tweet-heavy, Arab Spring-related thesis he was working on.

MIT’s Computer Science and Artificial Intelligence Lab has now snapped up the innovator. Though some questioned hiring someone with such a lean computer-science education, Lab director Sam Madden knows that Mostak’s unconventional background only means he has a unique point of view. The nascent computer scientist has already shown he has the talent to make it in this field.

Though Mostak says he still has work ahead to perfect his system, he does plan to share MapD as an open source project in the near future. Is he concerned about opening his work to the public? Nope; he states:

“If worse comes to worst, and somebody steals the idea, or nobody likes it, then I have a million other things I want to do too, in my head. I don’t think you can be scared. Life is too short.”

That it is. I suspect we will be hearing more from this creative thinker in the years to come.

Cynthia Murrell, April 30, 2013

Sponsored by ArnoldIT.com, developer of Augmentext

« Previous PageNext Page »

  • Archives

  • Recent Posts

  • Meta