Over a million developers have joined DZone.

Seek and You Shall Find, or Maybe Delay a Lot

· Performance Zone

Download Forrester’s “Vendor Landscape, Application Performance Management” report that examines the evolving role of APM as a key driver of customer satisfaction and business success, brought to you in partnership with BMC.

I have been doing a lot more analysis of the actual disk access patterns that we see in Voron. In the beginning, I strongly suspected that the problem was with how I was using memory-mapped files. In particular, some experiments with using my better knowledge of the environment has led to substantial performance gains. Instead of calling FlushViewOfFile(0, fileLen), I was able to call FlushViewOfFile on just the ranges that I knew changed.

That helped, but it wasn’t nearly enough. So I ran some quick tests using file stream, and realized that the fault was obviously with the broken way Windows is using memory-mapped files. So I decided (no, I didn’t write my own mmap impl, thank you very much) to take manual care of how Voron is writing to disk. I used WriteFile with all the bells and whistles, even had async I/O for a while there. I was able to directly pinpoint the exact locations where we needed to write, pass that to Windows in an efficient manner, and be done with it. It required writing a malloc implementation in managed code, but it was quite efficient-looking code.

And then I ran it. Just to give you some perspective, the scenario under question here is 10,000 transactions doing 100 random writes each. Using the memory map approach, after the FlushViewOfFile range optimization, I got roughly 7,000 operations per second. Using my own hand-written, optimized I/O, I got … 262 operations per second. OK … so maybe I need to go back to the drawing board.

I sat down and started working on figuring out what is actually going on. I looked at the actual output that we had, in particular, how many writes did we have per transaction? I sat down to analyze what was going on. We were writing 100 records with a 16-byte key and 100-byte value. That means that the absolute minimum amount we could write was 11,600 bytes. However, we were writing in 4 KB pages, which bring us to three pages and 12,288 bytes per transaction. Of course, this ignores things like the writing of branch pages in the B+Tree, so let's see what the real numbers are.

 Flush     1 with  12 pages   - 48 kb writes and 1  seeks   (11 leaves, 1 branches, 0 overflows)
 Flush     2 with  13 pages   - 52 kb writes and 1  seeks   (12 leaves, 1 branches, 0 overflows)
 Flush     3 with  21 pages   - 84 kb writes and 1  seeks   (20 leaves, 1 branches, 0 overflows)
  
 Flush    27 with  76 pages   - 304 kb writes and 1 seeks  (75 leaves,  1 branches, 0 overflows)
 Flush    28 with  73 pages   - 292 kb writes and 1 seeks  (72 leaves,  1 branches, 0 overflows)
 Flush    29 with  84 pages   - 336 kb writes and 1 seeks  (80 leaves,  4 branches, 0 overflows)
  
 Flush 1,153 with 158 pages - 632 kb writes and 67  seeks (107 leaves, 51 branches, 0 overflows)
 Flush 1,154 with 168 pages - 672 kb writes and 65  seeks (113 leaves, 55 branches, 0 overflows)
 Flush 1,155 with 165 pages - 660 kb writes and 76  seeks (113 leaves, 52 branches, 0 overflows)
  
 Flush 4,441 with 199 pages - 796 kb writes and 146 seeks (111 leaves, 88 branches, 0 overflows)
 Flush 4,442 with 198 pages - 792 kb writes and 133 seeks (113 leaves, 85 branches, 0 overflows)
 Flush 4,443 with 196 pages - 784 kb writes and 146 seeks (109 leaves, 87 branches, 0 overflows)
  
Flush 7,707 with 209 pages - 836 kb writes and 170 seeks (111 leaves, 98 branches, 0 overflows)
 Flush 7,708 with 217 pages - 868 kb writes and 169 seeks (119 leaves, 98 branches, 0 overflows)
 Flush 7,709 with 197 pages - 788 kb writes and 162 seeks (108 leaves, 89 branches, 0 overflows)
  
 Flush 9,069 with 204 pages - 816 kb writes and 170 seeks (108 leaves, 96 branches, 0 overflows)
 Flush 9,070 with 206 pages - 824 kb writes and 166 seeks (112 leaves, 94 branches, 0 overflows)
 Flush 9,071 with 203 pages - 812 kb writes and 169 seeks (105 leaves, 98 branches, 0 overflows)

The very first transactions were already showing something very interesting, we were actually writing 12 - 21 pages, or 48  - 84 KB of data, instead of 12 KB. Why did we write four times as much data as we wanted?

The answer is that we were writing data that is random in nature, so it can’t all sit in the same page; we get a lot of page splits and very quickly we end up with a lot of pages. This is pretty much inevitable, since this is how trees work. But look at what happens down the road. In particular, look at lines nine and 10. You can see that we were at that point at a pretty stable state. We are writing ~160 pages per transaction. And since we write random data, we tend to touch about 100 leaf pages per transaction (the stuff after 100 is usually page splits). But something that is much more interesting can be seen in the count of seeks.

The way LMDB works, we use copy-on-write, so whenever we modify a page, we are actually modifying a copy and marking the actual page as available for future transactions to reuse. This has a great advantage in that we don’t ever need to do compaction, but it also means that when we do want to do writes, we have to make them pretty much all over the place. And it actually gets worse as more times goes by.

Now, you have to realize that this is pretty much the worst case scenario, a transaction that does a lot of writes all over the place. But it means that toward the end, we are really hurting.

You already know the numbers, right? What about just straight-out writing one MB? What is the cost of seeks? I wrote the following code:

 var buffer = new byte[1024*1024];
 var random = new Random();
 random.NextBytes(buffer);
 if(File.Exists("test.bin"))
     File.Delete("test.bin");
 using (var fs = new FileStream("test.bin", FileMode.CreateNew, FileAccess.ReadWrite))
 {
     fs.SetLength(1024 * 1024 * 768);
     // warm up
     for (int i = 0; i < 200; i++)
     {
         fs.Position = random.Next(0, (int)fs.Length);
         fs.Write(buffer,0, random.Next(0, 1024));
     }
  
     var sp = Stopwatch.StartNew();
     for (int i = 0; i < 200; i++)
     {
           fs.Position = random.Next(0, (int)fs.Length);
           fs.WriteByte(1);
     }
     fs.Flush(true);
     sp.Stop();
     Console.WriteLine("200 seeks & 200 bytes {0:#,#} ms", sp.ElapsedMilliseconds);
  
     sp = Stopwatch.StartNew();
     fs.Position = random.Next(0, (int)fs.Length);
     fs.Write(buffer, 0, buffer.Length);
     fs.Flush(true);
     sp.Stop();
     Console.WriteLine("1 MB write {0:#,#} ms", sp.ElapsedMilliseconds);
 }

The results are quite interesting:

 200 seeks & 200 bytes 146 ms
 1 MB write 6 ms

Just to note, this is when running on an SSD, the numbers are supposed to be a lot worse when running on HDD.

In other words, it ain’t the size of the write, but how you spread it around that really matters. Just to compare, here are the numbers for when we are doing sequential writes:

Flush      1 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
 Flush      2 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
 Flush      3 with   6 pages -  24 kb writes and   1 seeks (  5 leaves,   1 branches,   0 overflows)
  
 Flush    159 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
 Flush    160 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
 Flush    161 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
 Flush    162 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  
 Flush  1,320 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
 Flush  1,321 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
 Flush  1,322 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
  
 Flush  4,316 with   7 pages -  28 kb writes and   3 seeks (  5 leaves,   2 branches,   0 overflows)
 Flush  4,317 with   7 pages -  28 kb writes and   2 seeks (  5 leaves,   2 branches,   0 overflows)
 Flush  4,318 with   8 pages -  32 kb writes and   3 seeks (  6 leaves,   2 branches,   0 overflows)
  
 Flush  7,409 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
 Flush  7,410 with   9 pages -  36 kb writes and   2 seeks (  6 leaves,   3 branches,   0 overflows)
 Flush  7,411 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)
  
 Flush  9,990 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
 Flush  9,991 with   9 pages -  36 kb writes and   4 seeks (  6 leaves,   3 branches,   0 overflows)
 Flush  9,992 with   8 pages -  32 kb writes and   3 seeks (  5 leaves,   3 branches,   0 overflows)
 Flush  9,993 with   8 pages -  32 kb writes and   4 seeks (  5 leaves,   3 branches,   0 overflows)

Because we are always writing at the end, we only need to touch a few pages. Note that even with the small number of pages, we still need to do quite a bit of seeks, relative to the number of pages we write.

I have some ideas about this, but they are still unformed. Mostly we need to balance between the amount of free space that is used to the number of seeks allowed. I think we can get there by being smart about tracking the number of pages modified by a transaction, and waiting until free space becomes available in sufficient numbers to be relevant. Something like that would allow auto-tuning of the amount of garbage we accumulate versus the number of seeks we require.

See Forrester’s Report, “Vendor Landscape, Application Performance Management” to identify the right vendor to help IT deliver better service at a lower cost, brought to you in partnership with BMC.

Topics:

Published at DZone with permission of Ayende Rahien, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}