Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Production Crash Root Cause Analysis Step-by-Step

DZone's Guide to

Production Crash Root Cause Analysis Step-by-Step

We take a look at an example of an analysis of how a crash in a stress-testing environment happened and what needs to be done about it.

Free Resource

RavenDB vs MongoDB: Which is Better? This White Paper compares the two leading NoSQL Document Databases on 9 features to find out which is the best solution for your next project.  

One of our production servers crashed. They do that a lot because we made them do evil things in harsh environments so they would crash on our end and not when running on your systems. This one is an interesting crash.

We are running the system with auto dump, so a crash will trigger a mini dump that we can run. Here is the root cause of the error, and IOException here:

    SP               IP               Function
    00000024382FF0D0 00007FFD8313108A System_Private_CoreLib!System.IO.FileStream.WriteCore(Byte[], Int32, Int32)+0x3cf7ea
    00000024382FF140 00007FFD82D611A2 System_Private_CoreLib!System.IO.FileStream.FlushWriteBuffer(Boolean)+0x62
    00000024382FF180 00007FFD82D60FA1 System_Private_CoreLib!System.IO.FileStream.Dispose(Boolean)+0x41
    00000024382FF1D0 00007FFD82D60631 System_Private_CoreLib!System.IO.FileStream.Finalize()+0x11


And with this, we are almost done analyzing the problem, right? This is a crash because a finalizer has thrown, so we now know that we are leaking a file stream and that it throws on the finalizer. We are almost, but not quite, done. We are pretty careful about such things, and most of all of our usages are wrapped in a using statement.

The reason for the error was quite apparent when we checked: we ran out of disk space. But that is something that we extensively tested and we are supposed to work fine with that. What is going on?

No disk space is an interesting IO error, because it is:

  • transient, in the sense that the admin can clear some space, and does not expect you to crash.
  • persistent, in the sense that the application can’t really do something to recover from it.

As it turns out, we have been doing everything properly. Here is the sample code that will reproduce this error:

using (FileStream fs = new FileStream("/path/with/10/bytes/of/free/space", FileMode.CreateNew))
{
  for (int i = 0; i < 12; i++)
  {
       fs.WriteByte((byte)i);
  }
}

For the moment, ignore things like file metadata, etc. We are writing 12 bytes, and we have only place for 10. Note that calling WriteByte does not go to the disk, extend the file, etc. Instead, it will just write to the buffer and the buffer will be flushed when you call Flush or Dispose.

Pretty simple, right?

Now let us analyze what will happen in this case, shall we?

We call Dispose on the FileStream, and that ends up throwing, as expected. But let us dig a little further, shall we?

Here is the actual Dispose code:

public virtual void Close()
{
    Dispose(true);
    GC.SuppressFinalize(this);
}


This is in Stream.Dispose. Now, consider the case of a class that:

  • Implements a finalizer, such as FileStream.
  • Had a failure during the Dispose(true); call.
  • Will have the same failure the next time you call Dispose(), even if with Dispose(false);

In this case, the Dispose fails, but the SuppressFinalize is never called, so this will be called from the finalizer, which will also fail, bringing our entire system down.

The issue was reported here and I hope that they will be fixed by the time you read it.

Do you pay to use your database? What if your database paid you? Learn more with RavenDB.

Topics:
server ,code ,production ,crash ,root cause analysis ,database ,ravendb ,stress testing ,performance

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}