Unlike previous posts in this series, this is actually something that happened to our own production server today. It resulted in our website being inaccessible for a couple of hours, and like most such stories, its root cause is tremendously simple, and through a series of unfortunate accidents, it had escalated to a major issue.
First, this post is dedicated to this book, which should be read by any self-respecting developer whose code is expected to hit production.
This post is also written a few hours only after the incident was resolved. Before we actually implemented anything except a temporary workaround. I’ll probably have another post in a couple of days to talk about the steps we are going to take to alleviate a repeat of this incident.
The incident started innocently enough when one of the guys on the team discovered that the startup time of a certain instance jumped by a lot. Investigating into why he realized that the issue was extremely slow responses from our server. That was a cause of triple concern, actually. First, why are we accepting such slow responses instead of time limiting them? Second, why are we making a remote synchronous call during startup? And third, why on Earth is our server so slow?
Logging into the server, it didn’t take long to see what the problem was. The www.RavenDB.net website (the code that runs the RavenDB website, not RavenDB itself) was consuming a lot of CPU and quite a bit of memory. In a bit to restore the other services which reside on the same box, we reset the process. Our main concern at the time was to restore service as soon as possible, and we planned on investigating further through the logs.
However, in a few minutes, the www.RavenDB.net website started consuming more and more resources. At that point, we started considering a DoS attack of some sort and looked a the logs. The logs did show a very high number of requests, much more than I would expect. But looking further into them, it looked like they were mostly bots of various kinds indexing our site.
Considering that this might be the case of Google hammering us, we configured a robots.txt on the site and waited to see if this would have an impact. It didn’t.
The next step was to take a process dump of the process and then analyze it. During this period, we had to shut down www.RavenDB.net because it was killing all other services running on the server.
Looking at the dump in WinDBG, we started with the obvious commands.
- !runaway – to find out the thread CPU times
- switching to the busiest threads
- !clrstack - to see what it is doing
Honestly, this is a much nicer way of looking at this, though:
As you can see, the threads are currently actually doing parsing of a Razor template and seems to be doing that on a fairly continuous basis, consuming all system resources.
At that point, I started getting concerned about the well-being of the poor guy’s inbox as a result of this code. That was the point where we actually did what should have probably been our first action, and looked at the error log of the website.
Previously, we looked at the live metrics and the request log but didn’t consider looking into the error log for the system. The error log for the website, for today only, was 6GB in size, and was pretty full of errors such as:
- error: (221, 88) 'HibernatingRhinos.Orders.Common.EmailProcessing.EmailTemplates.RavenDBWebsite.Models.DownloadQuestionMailInput' does not contain a definition for 'Unsubscribe' and no extension method 'Unsubscribe' accepting a first argument of type
And at that point, we had enough to point a suspicious finger. We have an email that we send out, and we used to have a valid template. At some point, the code was changed and the Unsubscribe was removed. Nothing broke because the template is just a file, not actually compiling code. However, in production, when we tried to send the email, Razor would parse the text, fail compilation because of the missing member, and basically throw a hissy fit.
Update: We investigated further, and it looks like the following was the actual “solution” to the outage:
The “solution” is in quotes, becasue this fixes the problem, but we need to still implement steps to ensure that something like that doesn’t repeat.
Unfortunately, at that point, we would consider this email as failing, and move on to the next one. That next one would also fail, and so would the next one, etc. Because all of them failed, they would get picked up again next time this run.
Once we knew where the problem was. The workaround was to deploy a version with no email sending. For this weekend, that will work. But come Sunday, someone is going to go over this piece of code with a very fine comb. I’ll post more about it once this actually rolls around.