Over a million developers have joined DZone.

The Firefox 7 Sync Meltdown: An Interesting Postmortem

DZone 's Guide to

The Firefox 7 Sync Meltdown: An Interesting Postmortem

· DevOps Zone ·
Free Resource
You may have noticed major performance issues with Firefox Sync over the last few days if you use the feature.  The Mozilla Service Operations have written up a very comprehensive and honest postmortem, just as you would expect from an open source, transparent, non-profit organization.  Jeff Vier explains:

By the 1st of October, we were starting to receive notifications from our monitoring system indicating individual databases were overloading to the point where they were unresponsive. In order to repair the errors users would experience from their assigned database node being overloaded, we began initiating “migrations” of those users to other less loaded database servers. The problem with this approach is that it involves forcing a “First Sync” of all the users data to a brand new node, inducing additional load on that server as well.  At this point, we’re sure there’s some sort of problem with Firefox 7 and how it’s interacting with the Sync servers, but we’re unable to find any correlation between failure rates and any functionality in Fx7. Outside of 503 errors occurring between our Zeus load-balancers and the back-end platform, we’re receiving nothing indicating what the root cause may be. -- Jeff Vier  

They clearly have a DevOps culture at Mozilla:

Eng/Ops side-bar: Mozilla Services is unlike any organization I’ve worked for previously. It’s not just that Engineering and Operations are friendly with each other (which I’ve encountered at a few other places), but we work very closely on a daily basis and are very cooperative. The two teams sit intermingled (those who are based out of the main Mountain View office are, anyway), the bulk of our online team chats happen in a combined chat room. When there are all-hands events at the Mozilla offices, our team dinners include both Eng & Ops. It’s fantastic, and makes for an excellent working environment even when things aren’t going to plan.

-- Jeff Vier

The problem was a bug in the instant sync feature, but the real value of this post is getting a look at how a major operation like Mozilla deftly handles these kinds of meltdowns with a step by step account of monitoring, actions taken, and follow up actions for Ops and Engineering.  Perhaps you'll find some strategies that can be implemented in your own systems. 

Read all about it here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}