Over a million developers have joined DZone.

Non-reproducible or Intermittent Error Handling

Ayende Rahien shares an anecdote about a particularly troubling non-reproducible error while testing.

· Performance Zone

Discover 50 of the latest mobile performance statistics with the Ultimate Guide to Digital Experience Monitoring, brought to you in partnership with Catchpoint.

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug, so we looked into it pretty seriously.

And we kept hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kinds of symptoms. We have a lot of experience in dealing with that and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort of error is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually, we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

bool hasMoreWork = false;
do
{
    var sp = Stopwatch.StartNew();
    lock(lockObj)
    {
        while(sp.ElapsedMilliseconds < 150) // don't hold the lock too long, let other people go in
        {
            // do work and set hasMoreWork
        }
    }
    if(hasMoreWork)
    Thread.Sleep(32); // let other threads take the lock
}while(hasMoreWork);

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work.

Is your APM strategy broken? This ebook explores the latest in Gartner research to help you learn how to close the end-user experience gap in APM, brought to you in partnership with Catchpoint.

Topics:
error handling ,locks

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}