A Fresh Look at Big Data
May 8, 2013
Next week I am doing an invited talk in London. My subject is search and Big Data. I will be digging into this notion in this month’s Honk newsletter and adding some business intelligence related comments at an Information Today conference in New York later this month. (I have chopped the number of talks I am giving this year because at my age air travel and the number of 20 somethings at certain programs makes me jumpy.)
I want to highlight one point in my upcoming London talk; namely, the financial challenge which companies face when they embrace Big Data and then want to search the information in the system and search the Big Data system’s outputs.
Here are the simplified curves:
Notice that precision and recall has not improved significantly over the last 30 years. I anticipate that many search vendors will tell me that their systems deliver excellent precision and recall. I am not convinced. The data which I have reviewed show that over a period of 10 years most systems hit the 80 to 85 percent precision and recall level for content which is about a topic. Content collections composed of scientific, technical, and medical information where the terminology is reasonably constrained can do better. I have seen scores above 90 percent. However, for general collections, precision and recall has not been improving relative to the advances in other disciplines; for example, converting structured data outputs to fancy graphics.
I don’t want to squabble about precision and recall. The main point is that when an organization mashes Big Data with search, two curves must be considered. The first is the complexity curve. The idea is that search is a reasonably difficult system to implement in an effective manner. The addition of a Big Data system adds another complex task. When two complex tasks are undertaken at the same time, the costs go up.
Again I am not interested in identifying a particular Big Data or a particular search system. The main point is that when an organization rushes to embrace a blend of search and Big Data, costs are likely to be an issue.
There are four reasons:
First, the blend of search and Big Data is, in my opinion, new territory for most organizations. Search is “old news” and the cost of implementing a search system can be difficult to control. Even the Google Search Appliance imposes a price tag when deployed across a large corpus. Systems from enterprise giants like Dassault, Hewlett Packard, IBM, and Oracle can be substantial as well. Move these big company solutions in and expect to export fees.
Second, the technologies for search and Big Data are coming at information from two different directions. Search has been predicated on the user crafting a query to unlock whatever is in the index. Big Data is assumed to be too voluminous for anyone to “know” what’s in the content. The Big Data system discovers, reveals, shows, or performs some other function to find out what’s in the content.
Third, the computational processes for whacking away at Big Data can be resource intensive. The assumption or assertion that a system can process “all” or “everything” may be unfounded. Attempts to crunch Big Data may fail due to resource constraints or budgets.
Fourth, the fancy math which is easy to reference can be quite difficult to implement in a hybridized system. The reason is that multiple calculations exceed the computational capacity available for a task. Algorithms can be optimized which yields some benefits, but data scientists want to apply more sophisticated processes. As a result, the demand for computational resources in some situations continues to grow. At the same time, the volume of data to be processed grows as well. The result is that decisions have to be made to deliver “good enough” results. Usable outputs are better than no outputs. This problem is not fully appreciated until decisions of consequence must be based on hybridized systems. The phrase “I didn’t understand that” signals the onset of understanding the difference between marketing promises and actual system deliverables.
How many of those in my audience will agree with me? Darned few. My hunch is that most of those attending the conference will be looking to use “Big Data” as a way to generate revenue. I also think that only a handful of people will appreciate the computational challenges which hybridized systems can present.
Stephen E Arnold, May 8, 2013