Predictions on Big Data Miss the Real Big Trend
December 18, 2011
Athena the goddess of wisdom does not spend much time in Harrod’s Creek, Kentucky. I don’t think she’s ever visited. However, I know that she is not hanging out at some of the “real journalists’” haunts. I zipped through “Big Data in 2012: Five Predictions”. These are lists which are often assembled over a lunch time chat or a meeting with quite a few editorial issues on the agenda. At year’s end, the prediction lunch was a popular activity when I worked in New York City, which is different in mental zip from rural Kentucky.
The write up churns through some ideas that are evident when one skims blog posts or looks at the conference programs for “big data.” For example—are you sitting down?—the write up asserts: “Increased understanding of and demand for visualization.” There you go. I don’t know about you, but when I sit in on “intelligence” briefings in the government or business environment, I have been enjoying the sticky tarts of visualization for years. Nah, decades. Now visualization is a trend? Helpful, right?
Let me identify one trend which is, in my opinion, an actual big deal. Navigate to “The Maximal Information Coefficient.” You will see a link and a good summary of a statistical method which allows a person to process “big data” in order to determine if there are gems within. More important, the potential gems pop out of a list of correlations. Why is this important? Without MIC methods, the only way to “know” what may be useful within big data was to run the process. If you remember guys like Kolmogorov, the “we have to do it because it is already as small as it can be” issue is an annoying time consumer. To access the original paper, you will need to go to the AAAS and pay money.
The abstract for “Detecting Novel Associates in Large Data Sets by David N. Reshef1,2,3,*,†, Yakir A. Reshef, Hilary K. Finucane, Sharon R. Grossman, Gilean McVean, Peter Turnbaugh, Eric S. Lander, Michael Mitzenmacher, Pardis C. Sabet, Science, December 16, 2011 is:
Identifying interesting relationships between pairs of variables in large data sets is increasingly important. Here, we present a measure of dependence for two-variable relationships: the maximal information coefficient (MIC). MIC captures a wide range of associations both functional and not, and for functional relationships provides a score that roughly equals the coefficient of determination (R^2) of the data relative to the regression function. MIC belongs to a larger class of maximal information-based nonparametric exploration (MINE) statistics for identifying and classifying relationships. We apply MIC and MINE to data sets in global health, gene expression, major-league baseball, and the human gut microbiota and identify known and novel relationships.
Stating a very interesting although admittedly complex numerical recipe in a simple way is difficult, I think this paragraph from “The Maximal Information Coefficient” does a very good job:
The authors [Reshef et al] go on showing that that the MIC (which is based on “gridding” the correlation space at different resolutions, finding the grid partitioning with the largest mutual information at each resolution, normalizing the mutual information values, and choosing the maximum value among all considered resolutions as the MIC) fulfills this requirement, and works well when applied to several real world datasets. There is a MINE Website with more information and code on this algorithm, and a blog entry by Michael Mitzenmacher which might also link to more information on the paper in the future.
Another take on the MIC innovation appears in “Maximal Information Coefficient Teases Out Multiple Vast Data Sets”. Worth reading as well.
Forbes will definitely catch up with this trend in a few years. For now, methods such as MIC point the way to making “big data” a more practical part of decision making. Yep, a trend. Why? There’s a lot of talk about “big data” but most organizations lack the expertise and the computational know how to perform meaningful analyses. Similar methods are available from Digital Reasoning and the Google love child Recorded Future. Palantir is more into the make pictures world of analytics. For me, MIC and related methods are not just a trend; they are the harbinger of processes which make big data useful, not a public relations, marketing, or PowerPoint chunk of baloney. Honk.
Stephen E Arnold, December 18, 2011
Sponsored by Pandia.com, a company located where high school graduates actually can do math.