Note, this post was written by Federico. Where I had notes or stuff to extend, I explicitly marked it as such.
TLDR: Optimizing at this level is really hard. To achieve gains of 20%+ for Compare and from 200% to 6% in Copy (depending on the workload) we will need to dig very deep at the IL level.
Another area we looked deeply into is, how do we move and access the memory. This type of optimization work is especially relevant if you are using Voron to handle big workloads. With small databases the improvements can be felt, but where they shine is when dealing with multi-gigabyte databases or high-throughput key-value retrieves and put workloads (did anyone thought Bulk-Inserts?).
Using FreeDB as in this previous post we build an experiment which we could use to pinpoint the pain points making sure we could also repro it every single time (no indexes, no extra call around). Under the microscope 2 pain points were evident when dealing with memory. Comparing and moving memory around.
We usually compare memory straight from the mmaped unmanaged memory when looking for a particular document in Voron Trees; and to copy from and to Voron pages when storing and retrieving documents. These are very core operations for any storage engine, Voron is not an special case. Before we started the optimization effort we already had a pretty optimized routine.
What this method does is:
- If the memory blocks have zero size, there is no doubt they are equal.
- If the memory blocks are bigger than the size of a word (32 bits) we do a pre-compare over the aligned memory blocks (for performance) in order to rule out all the equals.
- As we cannot use words to calculate the output (handling the Endianness would cost us), we do a byte by byte comparison for the final check.
For our insert workload we were roughly in the 97.5 nanoseconds per memory compare in average. To put in context, if each assembler instruction could be executed in exactly 1 cycle (which usually is not true) then 3 instruction is an entire nanosecond, therefore our average instruction budget is 291 instructions. Remember this idealized processor, we will use this same comparison later for more complex analysis.
Memory compares can be of different sizes that is why controlling the environment is very important for this type of optimization work.
To deal with that and we were using many tricks from the optimization book. From ensuring that memory alignment is optimal to batch compares with bigger primitive sizes to pointer arithmetic. At first sight this one is the kind of method you won't optimize at all, it is pretty damn tight.
Ayende’s node – We have already done a optimization step on memory comparisons. We initially just shelled out to the native memcmp method, but the cost of doing a pinvoke call ended up being noticable, and we wrote the previously optimized version (and had several rounds of that) to alleviate that cost.
However, we took to the challenge because the payoff can be huge. For a very small bulk insert of 50,000 documents inserted in an empty database, we are talking about in the ballpark of 5 million compares (yeah you read it right). Even if we manage to squeeze 1% off, the sheer volume of calls will make it worthwhile. To achieve that we had to do the unthinkable, we had to resort to dig into the MSIL that method was generating. Armed with ILSpy we found out we may have a way to shave off some inefficiencies.
Here is the what this look like when we start actually putting analysis to action. You can see the method code (after decompilation, so we can be closer to the IL) as well as the issues that were discovered in the process.
Because of the size of the method the fastest way was to resort to use a C# decompile, even though we then matched it with the generated IL. The trick to use the C# decompiled version requires that we use a decompiler that is not too smart when dealing with the code. If the decompiler would have understood what was the original code intention and acted upon it, we would have never spotted some of the optimizations at this level. For example, the last loop decompiled with JetBrains dotPeek would look like this:
Always keep around an old version of a decompiler just in case you may need it .
Ayende’s note: In most cases, you can select the level of details that a decompiler can give you. With Reflector, for example, you can select how deeply it will decompile things, but even so, doing stupid decompilation can be very beneficial by showing us what is actually going on.
Understanding where the inefficiencies may be, is one thing, being able to prove them is another matter. And we will tackle all of them in future posts.
We will also leave the memcpy analysis for later because it builds on the optimizations used in memcmp and also requires a deep analysis of the Buffer.Memcpy method already available in the .Net Framework (for internal use of course).
If what we did to the poor Etags was evil. You are now arriving at the gates of the underworld.
Ayende’s note: This is a pretty complex topic, and it goes on for quite a while. In order to maintain interest, and to avoid having people getting lost in the details, I broke it apart for several posts. In the meantime, given the details in this post, how would you suggest improving this?