Swimming in a Hadoop Data Lake

July 8, 2014

I read an interview conducted by the consulting firm PWC. The interview appeared with the title “Making Hadoop Suitable for Enterprise Data Science.” The interview struck me as important for two reasons. The questioner and the interview subject introduce a number of buzzwords and business generalizations that will be bandied about in the near future. Second, the interview provides a glimpse of the fish with sharp teeth that swim in what seems to be a halcyon data lake. With Hadoop goodness replenishing the “data pond,” Big Data is a life sustaining force. That’s the theory.

The interview subject is Mike Lang, the CEO of Revelytix. (I am not familiar with Revelytix, and I don’t know how to pronounce the company’s name.) The interviewer is one of those tag teams that high end consulting firms deploy to generate “real” information. Big time consulting firms publish magazines, emulating the McKinsey Quarterly. The idea is that Big Ideas need to be explained so that MBAs can convert information into anxiety among prospects. The purpose of these bespoke business magazines is to close deals and highlight technologies that may be recommended to a consulting firm’s customers. Some quasi consulting firms borrow other people’s work. For an example of this short cut approach, see the IDC Schubmehl write up.

Several key buzzwords appear in the interview:

  • Nimble. Once data are in Hadoop, the Big Data software system, has to be quick and light in movement or action. Sounds very good, especially for folks dealing with Big Data. So with Hadoop one has to use “nimble analytics.” Also, sounds good. I am not sure what a “nimble analytic” is, but, hey, do not slow down generality machines with details, please.
  • Data lakes. These are “pools” of data from different sources. Once data is in a Hadoop “data lake”, every water or data molecule is the same. It’s just like chemistry sort of…maybe.
  • A dump. This is a mixed metaphor, but it seems that PWC wants me to put my heterogeneous data which is now like water molecules in a “dump”. Mixed metaphor is it not? Again. A mere detail. A data lake has dumps or a dump has data lakes. I am not sure which has what. Trivial and irrelevant, of course.
  • Data schema. To make data fit a schema with an old fashioned system like Oracle, it takes time. With a data lake and a dump, someone smashes up data and shapes it. Here’s the magic: “They might choose one table and spend quite a bit of time understanding and cleaning up that table and getting the data into a shape that can be used in their tool. They might do that across three different files in HDFS [Hadoop Distributed File System]. But, they clean it as they’re developing their model, they shape it, and at the very end both the model and the schema come together to produce the analytics.” Yep, magic.
  • Predictive analytics, not just old boring statistics. The idea is that with a “large scale data lake”, someone can make predictions. Here’s some color on predictive analytics: “This new generation of processing platforms focuses on analytics. That problem right there is an analytical problem, and it’s predictive in its nature. The tools to help with that are just now emerging. They will get much better about helping data scientists and other users. Metadata management capabilities in these highly distributed big data platforms will become crucial—not nice-to-have capabilities, but I-can’t-do-my-work-without-them capabilities. There’s a sea of data.”

My take is that PWC is going to bang the drum for Hadoop. Never mind that Hadoop may not be the Swiss Army knife that some folks want it to be. I don’t want to rain on the parade, but Hadoop requires some specialized skills. Fancy math requires more specialized skills. Interpretation of the outputs from data lakes and predictive systems requires even more specialized skills.

No problem as long as the money lake is sufficiently deep, broad, and full.

The search for a silver bullet continues. That’s what makes search and content processing so easy. Unfortunately the buzzwords may not deliver the type of results that inform decisions. Fill that money lake because it feeds the dump.

Stephen E Arnold, July 7, 2014

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta