An important aspect of Facebook's development culture is the idea that developers are fully responsible for how their code behaves in production. This philosophy mirrors the "DevOps" movement, which encourages lowering the wall between software development and IT operations.
If any of the code in a Facebook update causes problems in production, the developer who wrote it is on the hook for making sure that the issue gets resolved as quickly as possible.
--Ryan Paul, Ars Technica "Exclusive: a behind-the-scenes look at Facebook release engineering"
I think we've got more evidence here that "DevOps" is just a name for what many super-successful companies have already been doing. On to the TL;DR!
For deployment Facebook uses
- Uses custom BitTorrent P2P system
- Site updates take 15-30 minutes
- JS, CSS, and Graphics are hosted on distributed CDNs
- Update goes to an "a2" tier which rolls update out to a small, random collection of users.
- Check-in procedure on company IRC where all developers who submitted code for the update have confirm that they are ready to respond if there is a problem when the update goes out.
- Head of release engineering issues command in terminal to begin deployment
- Watch web-based monitoring dashboards as the update rolls out.
A small amount of servers will fail during most deployments, but it usually doesn't cause any issues. That's because any of Facebook's servers can handle any sort of page request, so they don't have to worry about serialization and migration of user session states. The servers can keep handling incoming page requests during their software updates. No-downtime, as we all know.
Frequency of deployments:
- One minor update on most business days
- One major update on a weekly basis, usually Tuesdays
"Release early and often." The mantra rings true at Google, Facebook, and many other companies that we look up to as the best in the business.
Make sure you check out the whole article for the full story on Facebook's release engineering.