Foursquare had a well discussed outage last week. This wasn't good news for Foursquare but it can be for the rest of us. By looking at what happened, we can all learn steps we can take to avoid a similar occurrence. I do want to state emphatically, I am not writing this to criticize Foursquare for anything. Scaling a platform is a tricky problem, often wrought with problems that are difficult to predict but completely obvious once they've occurred. Rather the point of this article is to discuss the lessons that I took from the incident to improve future development at Rearden.
Lesson 1 - Instrument Everything
If you do something or use something, log metrics about it. You have entities, components, services, and data stores. Your system is sent requests and events. Every request and event should be instrumented with timings, resources they use, and any metrics on any interesting volumes of work they do (render page, compute a graph, etc). The simplest rule that I follow is that if involves an interaction outside your process or the consumption of a precious resource (e.g. memory, files, sockets), there should be logs generated to track it. Of course as I discussed last week, Flume can make this more scalable and provide you with a frame work to...
Lesson 2 - Analyze Your Logs
Generating rich telemetry with all of your instrumentation is of limited use if you don't actually mine it. What exactly are you looking for though? Logs can be voluminous and a bit overwhelming. Again, a simple answer is any outliers. Most of us hope for systems that behave in a uniform manner, and in fact most do as the number of users and transactions grow into millions and beyond. Unfortunately outliers will exist but they are often leading indicators of problems and are therefore worth identifying and understanding.
Never under estimate the power of simple graphs either. Suppose the graph below that charts response time (could be application or database) per shard. The blue and green shards are reasonably close in response time while the red shard clearly is responding slowly at times which is a cause for concern. The gold shard though is struggling badly. There are clearly many ways that this fact can be determined with analysis, but simply plotting it allows it to be immediately visible. Graphing obvious metrics can be incredibly insightful, often giving clues as to what other data might be worth analyzing.
Lesson 3 - Partitioned Availability is Tricky
Partitioning clearly helps performance but also offers the opportunity to partition your availability. If a partition goes down, it takes those users down but other users can stay up. The theory is sound but actually implementing it is significantly harder. You have to build your components to correctly handle shards coming and going. You have to understand your dependencies completely because one wayward cross shard dependency can render your entire plan useless. Most importantly, you have to test your failed shard availability regularly, potentially before each deployment. The ease with which an unexpected dependency can slip in is surprising.
Lesson 4 - Perfect Storms Will Happen
Instrument your code, analyze the metrics, design your dependencies carefully, and test your system thoroughly, but none of it will ensure constant success. Turning on your servers, pointing them at the Internet, and encouraging them to use your product is an unforgiving endeavor. It will throw unexpected traffic, non-uniform distributions, and freak artifacts at you regularly just to remind you who you serve. We have to continue to learn from our own and other's disasters, improve and refine our designs and implementations, and realize that like all "disasters" where no lives are lost, they make for great stories after enough time has passed to forget the stress and embarrassment of the moment.