Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Non-reproducible or Intermittent Error Handling

DZone's Guide to

Non-reproducible or Intermittent Error Handling

Ayende Rahien shares an anecdote about a particularly troubling non-reproducible error while testing.

· Performance Zone
Free Resource

Evolve your approach to Application Performance Monitoring by adopting five best practices that are outlined and explored in this e-book, brought to you in partnership with BMC.

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug, so we looked into it pretty seriously.

And we kept hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kinds of symptoms. We have a lot of experience in dealing with that and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort of error is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually, we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

bool hasMoreWork = false;
do
{
    var sp = Stopwatch.StartNew();
    lock(lockObj)
    {
        while(sp.ElapsedMilliseconds < 150) // don't hold the lock too long, let other people go in
        {
            // do work and set hasMoreWork
        }
    }
    if(hasMoreWork)
    Thread.Sleep(32); // let other threads take the lock
}while(hasMoreWork);

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work.

Learn tips and best practices for optimizing your capacity management strategy with the Market Guide for Capacity Management, brought to you in partnership with BMC.

Topics:
error handling ,locks

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}