Incident Review For the Facebook Outage
The following is an analysis of the Facebook incident on 10/4/2021.
Join the DZone community and get the full member experience.Join For Free
Marking a highly unusual state of events, Facebook, Instagram, WhatsApp, Messenger, and Oculus VR were down simultaneously around the world for an extended period of time Monday.
The social network and some of its key apps started to display error messages before 16:00 UTC. They were down until 21:05 UTC, when things began to gradually return to normality.
Can humanity survive hours without the most important social media conglomerate of our time? Perish the thought! On a more serious note, as some users have been pointing out on Twitter, does the global outage highlight the challenges of such a dominant single technological point of failure?
Facebook Is Everywhere… It’s Beyond Just Social Media
What quickly became clear to us at Catchpoint was the fact that the outage was impacting the page load time of many popular websites that are not powered by Facebook. Why? Because Facebook ads and marketing tags are on almost every major website.
Here’s an aggregate of the 95th percentile of onload event time, referred to as document complete, alongside the availability and Catchpoint “bottleneck time” impact metric of the site’s embedded Facebook content, across the IR Top 100 sites, as measured from Catchpoint’s external active (synthetic) observability vantage points.
Notice the measurement of document complete spikes and sustains at 20+ second higher at 11:40 pm EST (15:40 UTC). These indicate that overall page load times for users were much higher than normal.
Alarms Started At Catchpoint When We Detected Server Failures
Here at Catchpoint, alarms started to trigger around 15:40 UTC. These alarms resulted from the fact that some of our HTTP tests for Facebook, WhatsApp, Instagram, and Oculus domains started to return HTTP error 503 (service unavailable). It’s worth noting that we do this type of monitoring as part of a benchmarking process. In this way, we are able to provide insights into the Internet as a whole – and clearly, the Facebook family going down hugely impacted the entire Internet.
Leveraging our historical dataset, we usually see that Facebook itself is a highly stable system with a few outages that occur every so often. The business has built a scalable, reliable, global service. Therefore, when we saw alarms about a Facebook outage, it was easy to determine there was a significant problem.
The snapshot below is from Data Explorer (catchpoint.com). It shows the server failures that first alerted us to the Facebook outage.
Five minutes later, we saw that the TTL of the DNS records of Facebook had expired, and the sad truth kicked in... no Facebook nameserver was available, and every DNS query towards www facebook.com was resulting in a SERVFAIL error (meaning a DNS query failed because an answer cannot be given).
This following screenshot is an example of the error message Facebook users saw.
Below are examples of the types of HTTP headers 503 errors seen initially.
charset=utf-8date: Mon, 04 Oct 2021 16:48:36 GMT
You can see that at first, it was returning a server failure. When DNS records were initially cached, Facebook's edge was unable to find an upstream proxy server as part of their communication setup.
The next set of screenshots show that when we queried Facebook’s top-level domain servers, they were not working.
Everything up to now lead us thinking that the cause of the issue was DNS... But was it?
A Tale Of Badge Failure And BGP
We may never know if the Facebook technical staff were indeed locked out of the server room and unable to fix their routers. At the same time, there is some truth to this final speculation: BGP was, indeed, heavily involved in this incident.
A Deep Dive Into The BGP Data
Facebook manages AS 32934. The networks it originates are usually stable, as can be seen from RIPEstat (RIPEstat - Ui2013/AS32934).
Something changed, however, at around 15:40 UTC. At that time, you can clearly see a spike in the number of BGP events.
Let’s see what public route collectors were able to view in relation to that. We’ll focus on BGP data collected by RIS rrc10 collector deployed at the Milan Internet Exchange (MIX) between 15:00 UTC and 16:00 UTC.
From a quick look at the snapshot of 08:00 UTC, AS 32934 was originating 133 IPv4 networks and 216 IPv6 networks. Looking at the update messages, it’s easy to spot that Facebook withdrew the routes to reach eight of those IPv4 networks and fourteen of those IPv6 networks around 15:40 UTC. This was exactly the time when all the Catchpoint alerts started to trigger, and people began to complain about outages.
Even though it was just a handful of networks experiencing outages, this incident demonstrates that it is not the quantity of networks that matters.
Some of the withdrawn routes were related to the Authoritative DNS nameservers of Facebook, which could not be reached anymore. This led to DNS resolutions from all over the world failing. Eventually, it resulted in DNS resolvers being flooded with requests.
Authoritative nameservers play a key role in DNS resolution, since they possess information on how to resolve a specific hostname under their authority.
The true causes lying behind the network withdrawals have not yet been disclosed by the Facebook team, but the rumors are that the underlying root cause was a BGP routine gone wrong.
Having a Quick Response Is Key, As Long As Your Badge Is Working!
The last few days have been rough, between Slack’s issues and today’s incident with Facebook and its related services. These incidents show that major outages happen to everyone, even the biggest tech companies.
How quickly you detect and get to the heart of those issues matters. Your runbooks also matter.
Sometimes fixing an escalation means you need to ensure your systems are different from one another. In this case, the badge systems your employees use to sign in and fix things should never be dependent on the thing you’re trying to fix.
Troubleshooting in these types of instances is rarely straightforward. In Facebook’s case, the symptoms were HTTP and DNS errors. However, as we’ve proved, the root cause was BGP.
Update: From Facebook's Postmortem Analysis
Today, the Facebook team released a very good post-mortem analysis of the incident. The source of the incident was not caused by DNS nor BGP, but rather by a maintenance routine job performed by Facebook staff aimed at assessing the availability of global backbone capacity, which backfired in (unintentionally) taking down all the connections in their backbone network. As a consequence, the Facebook routers couldn't speak to their data centers and this triggered a safety mechanism in which the BGP routes towards their DNS servers were withdrawn from the network, as we have seen in our analysis.
Kudos to the whole Facebook team for the prompt recovery, but most importantly for their transparency!
Published at DZone with permission of Alessandro Improta. See the original article here.
Opinions expressed by DZone contributors are their own.