Google and Reliability Engineering

April 30, 2016

There’s a new book about Google SRE. You can find some information about it at this link. In order to understand how the real world works, you may want to navigate to “Google Cloud Status.” The write up explains why “On Monday, 11 April, 2016, Google Compute Engine instances in all regions lost external connectivity for a total of 18 minutes, from 19:09 to 19:27 Pacific Time.” Good news. According to this article “Google Blames Two Bugs for 18 minute Global Comute Engine Outage,”

Benjamin Treynor Sloss, Google’s vice president of engineering, explained on the Google Cloud Status blog that a “timing quirk” in the IP block’s removal occurred when the engineers tried to spread out the new configuration for Compute Engine.

A Google wizard is quoted in the article as saying:

We will conduct an internal investigation and make appropriate improvements to our systems to prevent or minimise future recurrence.”

I assume that the pertinent section of the forthcoming book was not available to the Googlers with their fingertips on the keyboard prior to the outage. Books are one thing; site reliability in the real world is apparently another.

Stephen E Arnold, April 30, 2016

Comments

Comments are closed.

  • Archives

  • Recent Posts

  • Meta