By the 1st of October, we were starting to receive notifications from our monitoring system indicating individual databases were overloading to the point where they were unresponsive. In order to repair the errors users would experience from their assigned database node being overloaded, we began initiating “migrations” of those users to other less loaded database servers. The problem with this approach is that it involves forcing a “First Sync” of all the users data to a brand new node, inducing additional load on that server as well. At this point, we’re sure there’s some sort of problem with Firefox 7 and how it’s interacting with the Sync servers, but we’re unable to find any correlation between failure rates and any functionality in Fx7. Outside of 503 errors occurring between our Zeus load-balancers and the back-end platform, we’re receiving nothing indicating what the root cause may be. -- Jeff Vier
They clearly have a DevOps culture at Mozilla:
Eng/Ops side-bar: Mozilla Services is unlike any organization I’ve worked for previously. It’s not just that Engineering and Operations are friendly with each other (which I’ve encountered at a few other places), but we work very closely on a daily basis and are very cooperative. The two teams sit intermingled (those who are based out of the main Mountain View office are, anyway), the bulk of our online team chats happen in a combined chat room. When there are all-hands events at the Mozilla offices, our team dinners include both Eng & Ops. It’s fantastic, and makes for an excellent working environment even when things aren’t going to plan.-- Jeff Vier
The problem was a bug in the instant sync feature, but the real value of this post is getting a look at how a major operation like Mozilla deftly handles these kinds of meltdowns with a step by step account of monitoring, actions taken, and follow up actions for Ops and Engineering. Perhaps you'll find some strategies that can be implemented in your own systems.
Read all about it here.