Amazon: Server to Server Chattiness
July 27, 2008
In general, observers are pleased with Amazon’s explanation about the recent outage. You can read what Amazon offered as the “inside scoop” here. Center Networks has a useful wrap up plus links to the company’s earlier comments about the Amazon outage. For me, the most interesting point in the Center Networks’ write up was its gentleness. The same thought wove itself through Profy.com’s take on the issue, reminding me that Amazon is offering a low cost service. You can read Profy’s view here. Message passing in distributed, parallelized environments is an issue. Too many messages and the system chokes and simply quits working. Anyone remember the nCube’s problems? Too few messages, the massively parallel system becomes like a sleep over for your daughter’s seven middle school chums. Message passing is an issue in the older Microsoft data center architectures, which I wrote about in a series of three posts for this Web log. To solve the problem in 2006, if I interpret the Microsoft diagram I included in my write up, Microsoft “threw hardware at the problem”. You can read this essay here. Redmond then implemented a variety of exotic and expensive mechanisms to keep servers in sync, processes marching like the Duke University drill team, and SQL Server happily insulated from the chioking hands of Microsoft’s own code ninjas. Is there a better or at least less trouble-free way to accomplish messaging? Yes, but these techniques require specialized functions in the operating system, changes to who sends what messages and how the “master” in a procedure talks to “slaves”, and a rethink about how to keep message traffic from converging on a single “master”. Chattiness does not do justice to the technical complexity of the problem. You may want to navigate to Google.com and run a query on these terms: “Google File System”, “Chubby”, and “Jeffrey Dean”. The first term refers to add ons to Linux that Google implemented nine years ago. The second refers to one piece of Google plumbing for file locking and unlocking with references to the BigTable data management system. And “Dr. Dean” is a former AltaVista.com wizard who has become one of the people given the job of explaining how Google tackled messaging and related problems since 1999. I summarize most of the broad ideas in The Google Legacy (Infonortics 2005), but reading the primary source information can be illuminating. Google’s solution is not without its weaknesses, of course. “Chattiness” is not one of these vulnerabilities. In terms of total operations cost at Amazon, the AWS services get a tiny slice of the action. Amazon is using its infrastructure to squeeze value from its plumbing. I anticipate other issues going forward, and Amazon will address them. Over time, AWS will resolve “chattiness”. Perhaps the next problem will be minimized and repositioned as “an under cooked failed soufflé”? Wordsmithing is not engineering in my opinion. Agree? Disagree? Help me understand Amazon’s explanation of “offline”.
Stephen Arnold, July 28, 2008