When the Best and the Brightest Tech Stars Fail
March 15, 2019
Two outages. Two explanations.
Google’s March 12, 2019, outage was explained this way at Google Cloud Status Dashboard.
On Monday 11 March 2019, Google SREs were alerted to a significant increase in storage resources for metadata used by the internal blob service. On Tuesday 12 March, to reduce resource usage, SREs made a configuration change which had a side effect of overloading a key part of the system for looking up the location of blob data. The increased load eventually lead to a cascading failure.
I like the phrase “cascading failure.” Sounds inevitable.
Facebook’s explanation of its one day plus outage appeared in “Biggest Facebook Outage in its History Due to Database Issues.” The explanation was:
The company’s databases were “overloaded.”
Concentration, just like in the mainframe days, can create some challenges for those downstream. If the big outfits cannot deal with failure, I don’t feel bad when my Android phone complains it cannot connect to the Google Play store where malware may still live.
Stephen E Arnold, March 15, 2019