Isolating Root Cause: March 13th Facebook Outage
Learn more about last week's Facebook outage.
Join the DZone community and get the full member experience.
Join For FreeThe Facebook outage that impacted the company’s globally-popular suite of social media apps last week certainly caused lots of headaches. But reports that the interruptions were due to issues on the network level proved premature.
Facebook operates its own collection of 15 data centers globally, which serve as a redundant, internal backbone network. As a result, the public doesn’t readily have insight into how an external routing problem would impact the performance of Facebook’s internal network.
However, if a BGP route leak was impacting end-user experience, as originally reported, we’d be able to track path changes for users accessing their Facebook apps over the Internet to the Facebook edge (ie. one of the company’s 15 data centers).
On the day of the event (Wednesday, March 13), no such abnormal routing or BGP event showed up on AppNeta performance manager that could be attributed to the poor performance users experienced. This all indicates that the issue had to have been with the application.
How Do We Know?
The multi-path route visualization below from the Delivery component of AppNeta Performance Manager indicates that paths from all over the US and a few countries across the world were all able to access multiple endpoints that serve up Facebook without error on the network side. While not complete, this provides no indication of route changes during the rough 12:00pm to 2:00pm window where disruption was noticed. A closer look at the network view with BGP data doesn’t appear to show any additional AS Network changes, announcements, or large-scale route changes indicative of a route leak.
Routing is one of the first checks that we can verify. Next, we can see from our continuous monitoring that these routes do not show connectivity loss, changes in capacity, latency, or loss to the various targets. All indications are that network access to Facebook was working as normal for the duration of the outage.
Based on our overall reporting with far more network paths shown above we can see that, in fact, network performance met application requirements 99.97 percent of the time between March 13-14 with only a 0.021 percent service outage. This was the case across paths connecting to the Facebook edge across all 15 data centers.
Post Mortem
On Thursday, Facebook officially blamed the disruption on a server configuration change that “triggered a cascading series of issues” affecting all of the company’s apps and services. Facebook is not one app, but likely hundreds of interconnected microservices. The trend towards breaking out services instead of leaving apps in a monolithic state is not unique to them but highlights the fact that keeping track of all of the interdependencies is difficult for even the most successful apps. While speculation abounded on Wednesday, most issues that we see don’t get a detailed resolution.
Published at DZone with permission of Alec Pinkham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments