Map Reduce: The Great Database Controversy
January 18, 2008
I read with interest the article “Map Reduce: A Major Step Backwards” by David DeWitt. The article appeared in “The Database Column” on January 17, 2008. I agree that Map Reduce is not a database, not a commercial alternative for products like IBM’s DB2 or any other relational database, and definitely not the greatest thing since sliced bread.
Map Reduce is one of the innovations that seems to have come from top-notch engineers Google hired from AltaVista.com. Hewlett Packard orphaned an interesting search system because it was expensive to scale in the midst of the Compaq acquisition. Search, to Hewlett Packard’s senior management, was expensive, generated no revenue, and a commercial dead end. But in term s of Web search, AltaVista.com was quite important because it allowed its engineers to come face to face with the potential and challenges of multi-core processors, new issues in memory management, and programming challenges for distributed, parallel systems. Google surfed on AltaVista.com’s learnings. Hewlett Packard missed the wave.
So, Map Reduce was an early Google innovation, and my research suggests it was influenced by technology that was well known among database experts. In my The Google Legacy: How Search Became the Next Application Platform (Infonortics, Ltd. 2005) I tried to explain in layman’s terms how Map Reduce bundled and optimized two Lisp functions. The engineering wizardry of Google was making these two functions operate at scale and quickly. The engineering tricks were clever, but not like Albert Einstein’s sitting in a patent office thinking about relativity. Google’s “invention” of Map Reduce was driven by necessity. Traditional ways to match queries with results were slow, not flawed, just turtle-like. Google needed really fast heavy lifting. The choke points that plague some query processing systems had to be removed in an economical, reliable way. Every engineering decision involves trade offs. Google sacrificed some of the cows protected by certain vendors in order to get speed and rock bottom computational costs. (Note: I did not update my Map Reduce information in my newer Google study, Google Version 2.0 (Infonortics, Ltd. 2007). There have been a number of extensions to Map Reduce in the last three years. A search for the term MapReduce on Google will yield a treasure trove of information about this function, its libraries, its upside, and its downside.)
I am constantly surprised at the amount of technical information Google makes available as Google Papers. Its public relations professionals and lawyers aren’t the best communicators. I have found Google’s engineers to be remarkably communicative in technical papers and at conferences. For example, Google engineers rely on MySQL and other tools (think Oracle) to perform some data processes. Obviously Map Reduce is only one cog in the larger Google “machine.” Those of you who have followed my work about Google’s technology know that I refer to the three dozen server farms, the software, and the infrastructure as The Googleplex. Google uses this term to refer to a building, but I think it is a useful way to describe the infrastructure Google has been constructing for the last decade. Keep in mind that Map Reduce–no matter how good, derivative, or clever–is a single component in its digital matroska.
My analyses of Map Reduce suggest that Google’s engineers obsess about scale, not break through invention. I was surprised to learn that much of Google’s technology is available to any one; for example, hadoop. Some of Google’s technology comes from standard collections of algorithms like Numerical Recipes with Source Code CD-ROM 3rd Edition: The Art of Scientific Computing. Other bits and pieces are based on concepts that have been tested in various university computer science labs supported by U.S. government funds. And, there’s open source code kept intact but “wrapped” in a Google technical DNA for scale and speed. Remember that Google grinds through upwards of four score petabytes of data every 24 hours. What my work tells me is that Google takes well-known engineering procedures and makes them work at Google scale on Google’s infrastructure.
Google has told two of its “partners,” if my sources are correct, that the company does not have a commercial database now, nor does it plan to make a commercial database like IBM’s, Microsoft’s, or Oracle’s available. Google and most people involved in manipulating large-scale data know that traditional databases can handle almost unlimited amounts of information. But it’s hard, expensive, and tricky work. The problem is not the vendors. The problem is that Codd databases or relational database management systems (RDBMS) were not engineered to handle the data management and manipulation tasks at Google scale. Today, many Web sites and organizations face an information technology challenge because big data in some cases bring systems to their knees, exhaust engineers and drain budgets in a nonce.
Google’s publicly-disclosed research and its acquisitions make it clear that Google wants to break free of the boundaries, costs, reliability, and performance issues of RDBMS. In my forthcoming study Beyond Search, I devote a chapter to one of Google’s most interesting engineering initiatives for the post-database era. For the data mavens among my readers, I include pointers to some of Google’s public disclosures about their approach to solving the problems of the RDBMS. Google’s work, based on the information I have been able to gather from open sources, is also not new. Like Map Reduce, the concepts have been kicking around in classes taught at the University of Illinois, the University of Wisconsin – Madison, University of California – Berkeley, and the University of Washington, among others, for about 15 years.
If Google is going to deal with its own big data challenges, it has to wrap Map Reduce and other Google innovations in a different data management framework. Map Reduce will remain for the foreseeable future one piece of a very large technology mosaic. When archeologists unearth a Roman mosaic, considerable effort is needed to reveal the entire image. Looking at a single part of the large mosaic tells us little about the overall creation. Google is like that Roman mosaic. Focusing on a single Google innovation such as Chubby, Sawzall, containers (not the XML variety), the Programmable Search Engine, the “I’m feeling doubly lucky” invention, or any one of hundreds of Google’s publicly disclosed innovations yields a distorted view.
In summary, Map Reduce is not a database. It is no more of a database than Amazon’s SimpleDB service is. It can perform some database-like functions, but it is not a database. Many in the database elite know that the “next big thing” in databases may burst upon the scene with little fanfare. In the last seven years, Map Reduce has matured and become a much more versatile animal. Map Reduce can perform tasks its original designers did not envision. What I find delightful about Google’s technology is that it often does one thing well like Map Reduce. But when mixed with other Google innovations, unanticipated functionality comes to light. I believe Google often solves one problem and then another Googler figures out another use for that engineering process.
Google, of course, refuses to comment on my analyses. I have no affiliation with the company. But I find its approach to solving some of the well-known problems associated with big data interesting. Some Google watchers may find it more useful to ask the question, “What is Google doing to resolve the data management challenges associated with crunching petabytes of information quickly?” That’s the question I want to try and answer.
Stephen E. Arnold, January 18, 2008