DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones AWS Cloud
by AWS Developer Relations
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Partner Zones
AWS Cloud
by AWS Developer Relations
The Latest "Software Integration: The Intersection of APIs, Microservices, and Cloud-Based Systems" Trend Report
Get the report

Isolating Root Cause: March 13th Facebook Outage

Learn more about last week's Facebook outage.

Alec Pinkham user avatar by
Alec Pinkham
·
Mar. 19, 19 · News
Like (1)
Save
Tweet
Share
3.56K Views

Join the DZone community and get the full member experience.

Join For Free

The Facebook outage that impacted the company’s globally-popular suite of social media apps last week certainly caused lots of headaches. But reports that the interruptions were due to issues on the network level proved premature.

Facebook operates its own collection of 15 data centers globally, which serve as a redundant, internal backbone network. As a result, the public doesn’t readily have insight into how an external routing problem would impact the performance of Facebook’s internal network.

However, if a BGP route leak was impacting end-user experience, as originally reported, we’d be able to track path changes for users accessing their Facebook apps over the Internet to the Facebook edge (ie. one of the company’s 15 data centers).

On the day of the event (Wednesday, March 13), no such abnormal routing or BGP event showed up on AppNeta performance manager that could be attributed to the poor performance users experienced. This all indicates that the issue had to have been with the application.

How Do We Know?

The multi-path route visualization below from the Delivery component of AppNeta Performance Manager indicates that paths from all over the US and a few countries across the world were all able to access multiple endpoints that serve up Facebook without error on the network side. While not complete, this provides no indication of route changes during the rough 12:00pm to 2:00pm window where disruption was noticed. A closer look at the network view with BGP data doesn’t appear to show any additional AS Network changes, announcements, or large-scale route changes indicative of a route leak.

Routing is one of the first checks that we can verify. Next, we can see from our continuous monitoring that these routes do not show connectivity loss, changes in capacity, latency, or loss to the various targets. All indications are that network access to Facebook was working as normal for the duration of the outage.

Based on our overall reporting with far more network paths shown above we can see that, in fact, network performance met application requirements 99.97 percent of the time between March 13-14 with only a 0.021 percent service outage. This was the case across paths connecting to the Facebook edge across all 15 data centers.

Post Mortem

On Thursday, Facebook officially blamed the disruption on a server configuration change that “triggered a cascading series of issues” affecting all of the company’s apps and services. Facebook is not one app, but likely hundreds of interconnected microservices. The trend towards breaking out services instead of leaving apps in a monolithic state is not unique to them but highlights the fact that keeping track of all of the interdependencies is difficult for even the most successful apps. While speculation abounded on Wednesday, most issues that we see don’t get a detailed resolution. 

facebook

Published at DZone with permission of Alec Pinkham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • gRPC on the Client Side
  • Full Lifecycle API Management Is Dead
  • Reliability Is Slowing You Down
  • Monolithic First

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: