Excerpts from the RavenDB Performance Team Report: Optimizing Memory Comparisons, Size Does Matter
Join the DZone community and get the full member experience.Join For Free
note, this post was written by federico . in the previous post after inspecting the decompiled source using ilspy we were able to uncover potential things we could do. in this fragment we have a pretty optimized method to compare an entire 4 bytes per loop. what if we could do that on 8 bytes?
to achieve that we will use a ulong instead of a uint. this type of optimization makes sense for 2 reasons.
most of our users are already running ravendb in x64 where the native word is 8 bytes and voron is compiled on x64 only. but even if that were not true, since the late 2000’ most cpus would have a 64 bytes l1 cache line with half a cycle cost for a hit. so even if you can’t handle 64 bits in one go and the jit or processor have to issue 2 instructions you are still getting a l1 cache hit and no pipeline stall. which is great .
so without farther ado, this is the resulting code:
ayende’s note: in the code, the lp += (intptr)8/8; is actually defined as lp += 1; what is actually happening is that we are increasing by 8 bytes (size of ulong), and this is how ilspy decided to represent that for some reason.
the actual il generated for this is good:
it is just that the translation here is kind of strange.
therefore the question to ask here is: will skipping over the parts of the memory block that is equal at a faster rate will compensate for the cost of doing a final check with 8 bytes instead of 4 bytes?
well the answer is a resounding yes. it won’t have much impact in the first 32 bytes (around 3% or less). we won’t lose, but we won’t win much either. but after that it skyrocket.
// bandwidth optimization kicks in
size: 32 original: 535 optimized: 442 gain: 5.01%
size: 64 original: 607 optimized: 493 gain: 7.08%
size: 128 original: 752 optimized: 573 gain: 11.77%
size: 256 original: 1,080 optimized: 695 gain: 35.69%
size: 512 original: 1,837 optimized: 943 gain: 74.40%
size: 1,024 original: 3,200 optimized: 1,317 gain: 122.25%
size: 2,048 original: 5,135 optimized: 2,110 gain: 123.13%
size: 4,096 original: 8,753 optimized: 3,690 gain: 117.29%
those are real measurements. you can see that when bandwidth optimization kicks in the gains start to get really high. this means that changing the bandwidth size alone from 4 byte to 8 bytes got us an order of magnitude improvement stabilizing around 120%.
not bad for 2 lines of work.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.