Fun things happen when your code lives on the Internet. Take this one component we support and maintain at TUNE. It's old and brittle, but it's a low priority problem since it's usually not on fire and only a handful of folks use it. (When you're working at scale, you must sometimes embrace that mindset.)
One day last month, this codebase went from a lovable oddball to a legitimate problem when it began alerting constantly due to failed health checks. We have thousands of instances, but this particular component drove more PagerDuty alerts than any system company-wide over the past 4 weeks. No one knew why.
Now, this system has squawked at us from time to time in the past. Our usual approach is to patch it up and move on. I'm one of the chief perpetrators of this approach so I guess I'm obligated to defend it here. My reasoning always was, it's this weird little thing with very limited usage; there's practically no way I'll make it worse! This approach usually worked fine: the alarms would silence and we could focus on more valuable problems for the next few months.
I tried this same approach again in early January, and this time, it didn't work. I mean, it mostly worked; after my change, the alarms went from 20 a day to 2 a day. That was still just enough to cross the threshold of annoyance for our oncall engineers, though.
The Problem Was, We Didn't Know the Problem
I dug into the nginx logs to understand all the other requests that were being fired during these alerting moments and I soon realized something interesting about the logs. There were a lot of them! So many logs, in fact, that it was hard to glean anything via my usual cat / grep / awk / sort / uniq toolkit.
When gigabytes of evidence conflict with your own poorly-understood models of a system, it's time to reassess. That's what I did. Rather than slap on another band aid to this codebase, I spent a couple of days getting its access and error logs into our Elasticsearch + Logstash + Kibana stack. (Side note: the ELK stack's development environment is convoluted and insane. Once you finally get it into a state where you can actually debug things, you might as well go nuts and figure out how to parse lots of different logs. The end results are impressive and informative so it's probably worth the frustration.)
Once I was able to visualize request load and latency in Kibana, I realized that absolutely none of us understood how this codebase was being used or the load it was under. If you'd told me, prior to this, that this system was taking 2,000 requests a day, I would've believed you. In fact, I would've said, "Good job, little codebase, for handling that!" And I would've had no idea what I was talking about, because the logs showed 2 million requests a day. The crazy part isn't the load; the crazy part is that our expectations were orders of magnitude off and had been for years.
Once I was able to see that traffic, I realized a few things. First, this system was way more important to our customers than I thought and had been for a while. Second, it had been stretched way beyond its limits. Third, the thinking that led us into this conundrum had probably caused us to undervalue other components.
They're Out to Get You
If I had to draw a single conclusion from all this, it's that ignorance is not an operational strategy. Don't allow a lack of clanging alarms or 0 outraged bug reports soothe you into a false sense of security. If you want to write important code that performs meaningful tasks for your customers, you can't deploy it and hope for the best. You'll never spot your problems when you're operating via blind optimism.
If ignorance isn't an operational strategy, what is? Paranoia. You should code and run your systems like a large group of Internet lunatics are out to abuse the hell out of them.
With properly-directed, well-intentioned paranoia, you can spot your operational problems before they become catastrophes. Focus the paranoia on your process, and ensure that the delivery of metrics, alarms, and run books are required for every major push. Focus the paranoia on your metrics and review them regularly to understand where system performance is headed. Focus the paranoia on your outages and work hard to understand, then address, the root causes of your problems.
Once you've successfully changed your mindset and are running your systems like the Internet is out to get you, you've suddenly got a fascinating new lens to view your software through. You can think systematically about quality and scale. You don't have to rely on band-aids for the problem of the week. You can sleep well, knowing that 3 AM alarms will be few and legitimate. Ironically enough, you work your way to worry-free software by harnessing the power of paranoia.