Something scary happened on Saturday. At Famigo, we have several different monitoring systems for our production environment. At about 3 AM, they all collectively went nuts. I happened to be up then because my son had a coughing fit, so I checked the site, verified that nothing weird was happening, and chalked it up to code gremlins.
When I woke up a few hours later, I saw roughly 38 gazillion more alerts. Based on the amount of load we were seeing, you'd think Barack Obama had interrupted all network programming to give a Famigo plug (note to Barack: you should do this). Still, I didn't actually see many users on the site, just a lot of load. Also, we're load balanced in production, and each traffic-handling node was under heavy load; it wasn't just one machine.
Maybe It's Not a Problem?
Saturday mornings are a key time for us at Famigo because we send out a report to each activated user, showing what apps their kids played along with some personalized app recommendations. When I saw the load, I immediately thought something was wrong with the email generation process. Maybe we dropped an index somewhere and the queries were suddenly taking a long time? I resolved not to worry any more about this until the emails were done.
Here's the problem: the emails wouldn't finish. They were going so slowly due to the load that we'd be sending these emails for days. I now began to worry more. I canceled our email task and began troubleshooting in earnest.
Bizarre Facts Emerge
As I SSHed into our various boxes, I noticed, via a top command, that it was the Python process serving our website and API that was consuming 100% of the CPUs. That was intriguing. I restarted all of the usual suspects (MongoDB, lighttpd, our web app), only to see things quickly begin to degrade again. Within a few minutes, Python was once again consuming 100% of the CPU.
Like most web apps, we don't really do much work that's CPU-bound. In fact, one of the laws of performance I've learned is that if you do have work that's CPU-intensive, always do it in the background. And yet, we were clearly taxing the CPU. Maybe we accidentally pushed a commit that attempts to generate pi to 1 million digits every time someone made a web request?
It seemed clear that the issue was with our code. Here's another weird thing: we hadn't changed much lately. I went through all of the commits for the past 2 days, and it was all pretty boring stuff. Just to verify, I pointed my dev instance to our production database (don't try this at home, kids) and began to actually navigate through these recent changes. Like I initially thought, there was nothing earth-shattering; it was all pretty standard web stuff.
At this point, I began to think I was hallucinating. I hadn't seen Girl Interrupted, but I imagined that Winona Ryder got committed in that movie because of the mental strain of debugging in production.
A-ha! (Or, How to Solve Problems Through Random Mouse Clicks)
I didn't know what to do. I just began clicking around our site on my dev instance, monitoring page load time in Chrome. Everything looked just fine on my dev instance at first, and then I began to notice one trivial view taking a little bit longer than it should've. The page itself should've rendered in a second or so, and it was taking closer to 1.5 seconds. The more I reloaded, the longer it took. This was particularly interesting because this view was our application view (here's an example). We can render over 30,000 apps with that view, so while the logic is very simple, it's constantly being rendered because of all the traffic.
Now, we get to the funny part. As I mentioned, we analyze tens of thousands of apps and, if you knew the right app slug, you could actually render any app, even dirty ones, with that view. While we didn't advertise this fact, you could get to stuff like famigo.com/app/super-sexy-sex-time/. Those are clearly not the apps we want families to see, even progressive European families. So, I had recently put in just a bit of work to keep sex apps from rendering. Every time someone requested an app, we'd check to make sure it was in the set of allowed, non-sex apps before we rendered anything.
It's really just one line of code that does the no-sex-app check. The no-sex-app check wasn't being done in the database; I was basically saying, in Python, 'raise 404 if app not in good_apps'. That's so simple! On my development instance, it worked fine. It didn't slow down our unit tests. However, when I wrote a quick bash script to request that URL 10 times simultaneously, things began to explode. In production, when we regularly have 25+ visitors requesting that URL at all times, everything truly burst into flames. It turns out that this particular view was CPU-bound, I just didn't know it until it encountered some scale.
Scale Drives You Mental
I think this is a fascinating bug. It's very simple logic that would work fine if we had 3 or 5 people on the site at any moment. With 10x that traffic, it was catastrophic to the rest of the platform. Imagine how crazy this gets at 100x or 10,000x that traffic. That's what makes scalability fun: gigantic issues at scale often come from very innocuous code. At least for me, the root cause is never what I expect.
How do we prevent this from happening in the future? I'm not entirely sure. It's not really something that fits into a unit test or an integration test. As a stopgap solution, we wrote a Python decorator that wraps all of our views and logs how long they took to render. Based on that, we can calculate how long a view should take to render, and alert ourselves if the render time is outside a reasonable span of time. It's not perfect, but it's a start. Anybody have a better idea?