The Data Management Mismatch

April 15, 2009

I used to play table tennis in tournaments. Because table tennis is not America’s game, I found myself in matches with folks from other countries. I recall one evening in FAR on the Chambana campus I faced a fit Chinese fellow. We decided to hit a few and then play a match. In about 10 seconds, I realized that fellow was a squash player, and he had zero chance against me. There are gross similarities between squash and table tennis, but the context of each game is very different.

That’s the problem with describing one thing (ping pong) and squash (mainland China style). The words look similar and to the naive, the words may mean the same thing.

Now the data management mismatch. You can read a summary of a “controversial” report that pits the aging Codd database against the somewhat more modern MapReduce system. I describe the differences in my 2005 study The Google Legacy, and I won’t repeat them here.

Eric Lai’s “Researchers: Databases still beat Google’s MapReduce” here provides a good summary of this alleged face off. I am not going to dig into the entrails of this study nor the analysis by Mr. Lai. I do want to highlight this passage which caught my attention:

Databases “were significantly faster and required less code to implement each task, but took longer to tune and load the data,” the researchers write. Database clusters were between 3.1 and 6.5 times faster on a “variety of analytic tasks.” MapReduce also requires developers to write features or perform tasks manually that can be done automatically by most SQL databases, they wrote.

The paragraph makes clear that according to the wizards who ran this study, the Codd style database has some mileage left on it engine. I agree. In fact, I think some of the gurus at Google would agree as well.

What’s going on here is that the MapReduce system works really well for Google-scale, Google-type data operations for search and closely allied functions. When a Googler wants to crunch on a result set, the Googlers fire up a Codd database; for example, MySQL and do their thing.

Codd style databases can jump through hoops as well. But the point is that MapReduce makes certain types of large dataset tasks somewhat less costly to kit out.

I don’t think this is an either or. My research suggests that there is a growing interest in different types of data management systems. There are clever systems emerging from a number of companies. I have written about InfoBright, for instance.

I wrote a white paper with Sue Feldman which focused on a low profile Google project to tame dataspace. The notion is a step beyond Codd and MapReduce, yet dataspace has roots and shoots in both of these systems.

What we have is a mismatch? The capabilities of the systems are different. If I were to play the Chinese table tennis star in my basement, I would probably win. He would knock himself out on the hot water pipe that dips exactly where he steps to hit a forehand.

The context of the data management problem and the meaning of the words make a difference. Use the system that solves the problem.

Stephen Arnold, April 15, 2009


Comments are closed.