Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Non-reproducible or Intermittent Error Handling

DZone's Guide to

Non-reproducible or Intermittent Error Handling

Ayende Rahien shares an anecdote about a particularly troubling non-reproducible error while testing.

Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

We recently had to deal with a stress test that was failing (very) occasionally. We looked into that, and we figured out that this was actually exposing a high severity bug, so we looked into it pretty seriously.

And we kept hitting a wall.

There are a few reasons why you would have a non-reproducible error. The most common issue is if you have a race. For example, if two threads are modifying the same variable without proper synchronization, this will cause those kinds of symptoms. We have a lot of experience in dealing with that and all the signs were there. But we still hit a wall.

You see, the problem was that the entire section of the code we were looking at was protected by a lock. There was no possibility of an error because of threading since this whole part of the code just couldn’t run concurrently anyway.

So why was it failing only occasionally? If it is single threaded, it should be predictable. In fact, the reason there was a lock there, instead of the more complex merge operations we used to have, was specifically to support reproducibility. The other kind of issue that can create this sort of error is I/O (which has its own timing issues), but the error happened in a part of the code that was operating on internal data structures, so it wasn’t that.

Eventually, we figured it out. This was a long background operation, and because we needed to have the lock in place, we had a piece of code similar to this:

bool hasMoreWork = false;
do
{
    var sp = Stopwatch.StartNew();
    lock(lockObj)
    {
        while(sp.ElapsedMilliseconds < 150) // don't hold the lock too long, let other people go in
        {
            // do work and set hasMoreWork
        }
    }
    if(hasMoreWork)
    Thread.Sleep(32); // let other threads take the lock
}while(hasMoreWork);

Because this is a long running operation, under lock, we do this in stages, and make sure that other things can run while we do that. But this was exactly what introduced the variability in the test results, and that made it so random and annoying. Once we figured that this was the cause for the difference, all we had to do was write the proper log of operations, and execute it again.

The rest was just finding out which of the 200,000 transactions executed actually caused the problem, mere detail work.

Get comfortable using NoSQL in a free, self-directed learning course provided by RavenDB. Learn to create fully-functional real-world programs on NoSQL Databases. Register today.

Topics:
error handling ,locks

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}