While benchmarking RavenDB, we have run into several instances where the entire machine would freeze for a long duration, resulting in utter nonresponsiveness.
This has been quite frustrating to us since a frozen machine makes it kinda hard to figure out what is going on. But we finally figured it out, and all the details are right here in the screen shot.
What you can see is us running our current benchmark, importing the entire StackOverflow dataset into RavenDB. Drive C is the system drive, and drive D is the data drive that we are using to test the performance.
Drive D is actually a throwaway SSD. That is, an SSD that we use purely for benchmarking and not for real work. Given the amount of workout we give the drive, we expect it to die eventually, so we don’t want to trust it with any other data.
At any rate, you can see that due to a different issue entirely, we are now seeing data syncs in excess of 8.5 GB. Basically, we wrote 8.55GB of data very quickly into a memory mapped file, and then called fsync. At the same time, we started increasing our scratch buffer usage because calling fsync (8.55 GB) can take a while. Scratch buffers are a really interesting thing; they were born because of Linux's crazy OOM design and are basically a way for us to avoid paging. Instead of allocating memory on the heap like normal, which would then subject us to paging, we allocate a file on disk (mark it as temporary and delete on close) and then we map the file. That gives us a way to guarantee that Linux will always have a space to page out any of our memory.
This also has the advantage of making it very clear how much scratch memory we are currently using, and on Azure/AWS machines, it makes it easier to place all of those scratch files on the fast temp local storage for better performance.
So, we have a very large fsync going on, a large amount of memory mapped files, a lot of activity (that modify some of those files), and a lot of memory pressure.
That forces the Kernel to evict some pages from memory to disk to free something up. And under normal conditions, it would do just that. But here, we run into a wrinkle. The memory we want to evict belongs to a memory-mapped file, so the natural thing to do would be to write it back to its original file. This is actually what we expect the Kernel to do for us, and while for scratch files, this is usually a waste, for the data file, this is exactly the behavior that we want. But that is beside the point.
Look at the image above. We are supposed to be only using drive D, so why is C so busy? I’m not really sure, but I have a hypothesis.
Because we are currently running a very large fsync, I think that drive D is not currently processing any additional write requests. The “write a page to disk” is something that has pretty strict runtime requirements; it can’t just wait for the I/O to return whenever that might be. Considering the fact that you can open a memory mapped file over a network drive, I think that it is very reasonable that the Kernel will have a timeout mechanism for this kind of I/O. When the pager sees that it can’t write to the original file fast enough, it shrugs and writes those pages to the local page file instead.
This turns an otherwise very bad situation (very long wait/crash) into a manageable situation. However, with the amount of work we put into the system, that effectively forces us to do heavy paging (in the orders of GBs) and that, in turn, leads us to a machine that appears to be locked up due to all the paging. So, the fallback error handling is actually causing this issue by trying to recover, at least that is what I think.
When examining this, I wondered if this can be considered a DoS vulnerability, and after careful consideration, I don’t believe so. This issue involves using a lot of memory to cause enough paging to slow everything down. The fact that we are getting there in a somewhat novel way isn’t opening us to anything that wasn’t there already.