Scale Fail: Amazon and Pizza Team Engineering
July 21, 2008
My news reader is chock full of glowing embers of hostility this morning. It’s 8 30 am in rural Kentucky, where nothing works very well. Power failed again last night, but we have oil lamps and candles. Based on scanning a number of the Amazon S3 outage, Amazon may want to shore up Dr. Werner Vogels’ engineering team today. Shoestrings are great for keeping sneakers on my feet, but massively parallel distributed infrastructures needs a bit more than shareware, cleaver graduate students from the Netherlands, and technical reviews by PhD candidates from University of California computer science programs.
Amazon codes using teams large enough to be fed with one pizza. The idea is that a SOCOM-type unit is better than a rigorous engineering approach found at Boeing or even Microsoft for that matter. Amazon also allows its teams considerable latitude when solving problems. In fact, some teams can use whatever programming language or method that allows the team to solve the problem.
This is a burned pizza. Great ingredients, distracted chef. Source: http://msp71.photobucket.com/albums/i122/xoaleycat926ox/6298db24.jpg
This approach is fast, economical, and flexible. The downside is that if the fix triggers a fault elsewhere, the pizza team or teams must scramble to figure out what happened and why. If the previous team used some off beat language or clever method, then the fixers have to puzzle out the solution. Some folks love puzzles, but I don’t think Amazon Web Services’ customers are too keen on the approach, if I read some of the nasty grams this morning.
Om Malik’s “S3 Outage Highlights Fragility of Web Services” is among the best of the essays I reviewed. You can read his full post here. For me, the key point in his analysis was:
That said, the outage shows that cloud computing still has a long road ahead when it comes to reliability. NASDAQ, Activision, Business Objects and Hasbro are some of the large companies using Amazon’s S3 Web Services. But even as cloud computing starts to gain traction with companies like these and most of our business and communication activities are shifting online, web services are still fragile, in part because we are still using technologies built for a much less strenuous web.
I quite enjoyed Center Networks’ understatement aboiut the problem by reporting Amazon’s own comment:
Amazon S3 has “elevated error rates”.
I think this means crash or fail.
Finally, ReadWriteWeb.com provided a screen shot of the Amazon Serviice Health Dashboard here and notes that Smugmug.com took the outage in stride. Amazon has a history of befriending certain services. Part of this method is just commonsense and part of it is to position Amazon to become a financial partner with some companies. My thought was that Smugmug.com took the outage in stride because what could the company do? Criticize a partner for poor engineering? You can read SmugMug’s view here.
Observations
I don’t have any relationship with Amazon, although I attended a meeting at one of the firm’s offices in Seattle in the last six months. After that meeting with many bright, independent, and supremely confident engineers, I flipped through Amazon’s financials. What I learned was that the original management team, including its technology officers, has changed. I am not sure what type of corporate memory system Amazon has, but the current crop of top managers are relatively new on the Amazon job.
I also noted that Amazon spends hundreds of millions on technology, R&D, and other technical trifles. Google, which is only slightly bigger in annual turnover, spends billions. Google, despite its massive engineering investment in gear and wizards, also has some glitches on its record. These faults mean that Om Malik’s point is correct. Web services are not yet perfect. But the difference in investment is what struck me. I asked myself, “How can Amazon out Google Google or Microsoft for that matter and spend so little?”
My hypothesis is that Amazon has taken shortcuts without addressing some of the deep engineering issues that have been with the company since its early days. The use of open source technologies, commodity servers, and pizza teams have worked until an unexpected dependency surfaces. No one knows what the trigger is, and engineers have to pull all nighters to find the problem, fix it, and try and make sure the fix doesn’t create another problem somewhere else in Amazon’s sprawling system.
Google and Microsoft both have to slay similar dragons. Amazon is unique among the Web pantheon for performing Herculean feats using the budgeting methods of Dollar General or Odd Lots. Google and Microsoft invest in what each firm believes is solid, fundamental engineering. The approach is more costly than Amazon’s. Google and Microsoft struggle to tame problems such as Google’s mysterious Gmail glitches or Microsoft’s Live.com mail issues.
The situation at Amazon warrants close attention. One question I have asked some of the groups I briefed last year was, “How profitable would Amazon be if it just outsourced its plumbing to Google?” The real cost of Amazon’s technology may be a significant inhibitor of the firm’s long term profitability. Outsourcing may make more sense than Amazon’s trying to work engineering miracles by spending too little and using any other methods it can to keep its costs under control.
I like the idea of Google-zon. Agree? Disagree?
Stephen Arnold, July 21, 2008
Update, July 21, 2008, 10 10 pm Eastern: Another interesting view of Amazon Web Services plus a reminder that Googzilla is not perfect either. Click here for Delores Labs’s view.
Update, July 22, 2008, 10 15 am Eastern: Interesting write up from Google Morning Silicon Valley here. Short and snappy write up. “Customers” are described as of “generous spirit”. I am only generous when someone pays me. S3 customers are a different sort, I surmise.
Update, July 22, 2008, 10 31 am: Joe Wilcox explores the S3 outage. Worth reading. You can find “S3 Outage: Software Minus Services” here.
Update, July 22, 2008, 10 35 am: I don’t know how I overlooked Erick Schonfeld’s “Bezos Says He Wants S3 Uptime to Be Nothing Less than Perfection”. I’m getting old for sure. Mr. Schonfeld’s article is a must read here. The world’s smartest person (Jeff Bezos) talks about “the dawn of a new industry” and “perfection”. Words to clip and quote in the future. Kudos, Mr. Schonfeld. A bit of history this write up.
Comments
2 Responses to “Scale Fail: Amazon and Pizza Team Engineering”
Wow. A single S3 outage that doesn’t even contradict the 99.9% uptime claims in their SLA is now grounds for Google to outsource their infrastructure to Google? Really? Which Google service will they host this on? Google App Engine which can barely handle basic mashups and has worse uptime than Amazon’s EC2 and S3 services? Or is there some top secret hosting for Tier 1 sites service that Google has that you are privy to that hasn’t been shipped?
Dare, thanks for writing. Microsoft has a keen interest in Amazon. However, this outage is not the first. I recall the Xbox meltdown and several others. Somehow the AWS is attracting a great deal of attention. Perhaps the engineering is great for the SLA but not so great for those who expect the 100% uptime Microsoft delivers for certain services by scale up and out plus redundancy. Amazon can’t afford this type of solution, so the “pizza” issue looms large.
Stephen Arnold, July 21, 2008