The Randomly Failing Test
The format of the test was flawless, but once in a blue moon, we'd have a non-reproducible test failure. Why? The reason was pretty obvious once we looked at it.
Join the DZone community and get the full member experience.Join For Free
We made low-level change in how RavenDB is writing to the journal. This was verified by multiple code reviews and a whole battery of tests and production abuse. And yet, once in a blue moon, we’ll have a test failure. It's utterly nonreproducible, and it's only happening once every week or two (out of hundreds or thousands of test runs). That was worrying because this test was checking the behavior of RavenDB when it crashed midway through a transaction, which is kind of an important metric for us.
It took a long while to finally figure out what was going on there. The first thing that we ruled out was non-reproducibility because of threading. This test was single-threaded, and nothing could inject anything to the code.
The format of the test was something like this:
- Write 1,000 random fixed-size values to the database.
- Close the database.
- Corrupt the last page of the page journal.
- Start the database again and note that all the values in the last transaction are not in the database.
So far, awesome. So why would it fail?
The underlying reason was an obvious one, once we looked at it. The only thing that differs from test to test is the random call. But we are using fixed size buffers to write, so that shouldn’t change anything. The data itself is meaningless.
As it turned out, the data was not quite meaningless. As part of the commit process, we compress the data before we write it to the journal. As it turns out, different patterns of random buffers have different compression characteristics. In other words, a buffer of 100 random bytes may compress to 90 bytes or 102 bytes. And that mattered. If the test got enough random inputs to create a new journal file, we will still corrupt the last page on that journal, but since we already are on a new journal, that last page hasn’t been used yet, and the transaction wouldn’t become corrupt and we would have the data still in the database, effectively failing the test.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.