Facebook Engineering: Big Is Tricky

October 12, 2021

The unthinkable happened on October 4, 2021, when Facebook went offline. Despite all the bad press Facebook has recently gotten, the social media network remains an important communication and business tool. The Facebook Engineering blog explains what happened with the shutdown in the post: “More Details About The October 4 Outage.” The outage happened with the system that manages Facebook’s global backbone network capacity.

The backbone connects all of Facebook’s data centers through thousands of miles of fiber optic cable. The post runs down how the backbone essentially works:

“When you open one of our apps and load up your feed or messages, the app’s request for data travels from your device to the nearest facility, which then communicates directly over our backbone network to a larger data center. That’s where the information needed by your app gets retrieved and processed, and sent back over the network to your phone.

The data traffic between all these computing facilities is managed by routers, which figure out where to send all the incoming and outgoing data. And in the extensive day-to-day work of maintaining this infrastructure, our engineers often need to take part of the backbone offline for maintenance — perhaps repairing a fiber line, adding more capacity, or updating the software on the router itself.”

A routine maintenance job issued a command to assess the global backbone’s capacity. Unfortunately it contained a bug the audit system did not catch and it terminated connections between data centers and the Internet. A second problem made things worse. The DNS servers were unreachable yet still operational. Facebook would not connect to their data centers through the normal meals and loss of DNS connections broke internal tools used to repair problems.

Facebook engineers had to physically visit the backbone facility, which is armed with high levels of security. The facility is hard to enter and the systems are purposely designed to be difficult to modify. It took awhile, but Facebook diagnosed and resolved the problems. Baby Boomers were overjoyed to resume posting photos of their grandchildren and anti-vaxxers could read their misinformation feeds.

Perhaps this Facebook incident and the interesting Twitch data breach illustrate that big is tricky? Too big to fail become too big to keep working in a reliable way.

Whitney Grace, October 12, 2021

Comments

Got something to say?





  • Archives

  • Recent Posts

  • Meta