Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What I Learned About How Facebook Infrastructure Serves Our Photos

DZone 's Guide to

What I Learned About How Facebook Infrastructure Serves Our Photos

Everything you need to know about Facebook's CDN infrastructure.

· Security Zone ·
Free Resource

On July 3, users across the globe came to a standstill when they weren’t able to load photos on both Facebook and Instagram. Likewise, users of Facebook-owned WhatsApp weren’t able to send images or videos.

If Facebook was a standalone CDN, it would probably be in the top three CDNs in the world because of the sheer number of assets and traffic. Running the CDN infrastructure and networks at Facebook cannot be a small task.

Let’s start with what end users saw (or, more accurately, did not see):

End users' Facebook page
End users' Facebook page

The user experience issues were caused due to a failure to load images served by the Facebook CDN. Content coming from the host content scontent-.*-\d.xx.fbcdn.net got either a 503, 502 HTTP error code or a connection timeout.

Webpage Response by date and time
Webpage Response by date and time

Here’s a quick view of the Facebook CDN domains that were failing:

Domains that were failing
Domains that were failing

503 HTTP response code:  

503 HTTP response
503 HTTP response

Connection timeout:

Connection timeout
Connection timeout

502 HTTP response code:

502 HTTP response code
502 HTTP response

The possible root cause might have been a Config issue because we saw changes in the config version in the headers during the incident:

During the incident (Dallas – Cogent):

Request and response headers during the incident
Request and response headers during the incident

After the incident:

Request and response headers post-incident
Request and response headers post-incident

During the incident (London – Cogent):

Request and response headers during the incident
Request and response headers during the incident

After the incident:

Request and response headers post-incident
Request and response headers post-incident

As we begin to look at the data for the Facebook/Instagram/WhatsApp issues from Wednesday, it is interesting to understand how their CDN infrastructure works.

What We Learned

  1. Facebook CDN domains serving your photos are scontent-.*-\d.xx.fbcdn.net.

  2. The same object is served by different servers based on the user’s location. 

Take 25994550_10156466395424714_5937507471042938431_n.jpg for example. Based on the city from which the request originates, a different server serves the object:

Rate of response by date and time
Rate of response by date and time

  3. The hostnames have a code for the CDN edge serving the content. For example, scontent-SIN2-                              2.xx.fbcdn.net is an edge server in Singapore.

  4. These hosts map to a static IP and are not using an Anycast network:

As mentioned before, some servers can serve requests from multiple cities:


Network traces to scontent-sin2-2.xx.fbcdn.net shows a Unicast network design:

Unicast network design
Unicast network design

  5. It was interesting to see that requests coming from a particular city weren’t necessarily served from a CDN              location in the vicinity. We also saw that the request was served from a CDN location in a different country            altogether:

Request served in different country
Request served in a different country
  • Request from Atlanta served from CDN server in Hongkong.
  • Request from Bangalore served from CDN server in Singapore.
  • Request from Seattle served from CDN server in Stockholm.

That helps us understand why the response time is high for a number of requests in these cities: the requests are doing a world tour!

Looking at it slightly differently, we found that the commonality between the cities served by the same CDN server was the ISP. 

CDN servers serving multiple cities:

CDN servers serving multiple cities
CDN servers serving multipe cities

Common ISPs between the cities:

Common ISPs between cities
Common ISPs between cities

The one that definitely needs fixing is NTT IPv6, where the traffic is getting routed to Hong Kong. The underlying cause for this is the peering between NTT and Tata Communication, as you can see from the network path visualization below:

Peering between NTT and Tata Communication
Peering between NTT and Tata Communication

The rollout of the fix was gradual across the CDN servers, with some recovering faster than others:

Gradual recovery
Gradual recovery

Things got back to normal across the globe around 22:06 UTC (6:06 pm EST).

Incidents and outages are scary! We all need to ensure we have the right set of tools and processes in place to help us avoid them, and when they do happen, help us reduce MTTR.

But don’t miss out on the amount of learning that happens in the process of understanding what went wrong! You might even uncover an underlying problem that has been around for a very long time as you triage the current incident. This knowledge becomes key when you are faced with the next incident.

The post What I Learned About How Facebook Infrastructure Serves Our Photos appeared first on Digital Experience Monitoring | Catchpoint.

Topics:
secuirty ,facebook ,instagram ,cdn ,scontent ,503 http error code ,server

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}