Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Bringing down servers, one character at a time

DZone's Guide to

Bringing down servers, one character at a time

· DevOps Zone
Free Resource

The Nexus Suite is uniquely architected for a DevOps native world and creates value early in the development pipeline, provides precise contextual controls at every phase, and accelerates DevOps innovation with automation you can trust. Read how in this ebook.

Last week, I mentioned a bug that I caused and a few people asked me details about it, so here is what happened.

The code in question is parsing a calendar and processing meetings. Before showing the buggy code, here is what a shortened version of the fix ended up looking like:

while (maxDays < 7 && (eventCount < 20) || attendeeCount < 40) {
    // process meetings
    // increment maxDays whenever we cross a day boundary
    // possibly increment eventCount or attendeeCount based on the meeting
}

The bug went unnoticed in production for a few days until suddenly, the servers started pinning all their CPU's and shortly thereafter, depleting the memory of each instance. A restart was the only remedy.

This code runs a loop through someone's calendar and stops until we've gathered enough information. The maxDays variable is what I call a "safety valve", a condition used to guarantees that the code doesn't run amok in case its normal processing flow never finishes. And yet, the crash seems to indicate that the code was doing exactly that.

With this introduction out of the way, here is now the line that caused the bug:

Actually... you've already seen it, it's the code I pasted above. Did you spot the problem?

That's right, a simple set of parentheses in the wrong place. Because they are surrounding the wrong expression, they completely defeat the purpose of the safety valve, which only triggers for people with empty calendars (explaining why it took some time to trigger that bug).

Two characters to bring down an entire server...

Can we learn anything about this bug, besides the fact that testing needs to cover edge cases? A static analysis tool can't be of much help here since we're completely in the realm of runtime behavior, but is there some way that an automated tool could have warned me about the flaw of this termination expression?

The DevOps Zone is brought to you in partnership with Sonatype Nexus.  See how the Nexus platform infuses precise open source component intelligence into the DevOps pipeline early, everywhere, and at scale. Read how in this ebook

Topics:

Published at DZone with permission of Cedric Beust, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}