GoTo-Based Optimizations
Some of the changes we saw after optimizing our internal data structures in RavenDB and are pretty strange. This article will help you understand them.
Join the DZone community and get the full member experience.
Join For FreeOne of the things that we did recently was going over our internal data structures in RavenDB and seeing if we could optimize them. Some of those changes are pretty strange if you aren’t following what is actually going on. Here is an example:
Before:
After:
What is the point in this kind of change? Well, let us look at the actual assembly generated by this, shall we?
Before:
if (_size == 0)
00007FFB4DCA1FF0 push rsi
00007FFB4DCA1FF1 sub rsp,20h
00007FFB4DCA1FF5 mov esi,dword ptr [rcx+10h]
00007FFB4DCA1FF8 test esi,esi
00007FFB4DCA1FFA jne 00007FFB4DCA2052
{
ThrowForEmptyStack();
00007FFB4DCA1FFC mov rcx,7FFBACA881D0h
00007FFB4DCA2006 call 00007FFBAD9A79D0
00007FFB4DCA200B mov rsi,rax
00007FFB4DCA200E mov rcx,rsi
00007FFB4DCA2011 call 00007FFBAC7014C0
00007FFB4DCA2016 mov ecx,1
00007FFB4DCA201B mov rdx,7FFB4DB45010h
00007FFB4DCA2025 call 00007FFBAD76B420
00007FFB4DCA202A lea rcx,[rsi+20h]
00007FFB4DCA202E mov rdx,rax
00007FFB4DCA2031 call 00007FFBAD9A5B00
00007FFB4DCA2036 mov dword ptr [rsi+84h],80131501h
00007FFB4DCA2040 mov dword ptr [rsi+84h],80131509h
00007FFB4DCA204A mov rcx,rsi
00007FFB4DCA204D call 00007FFBAD76DC40
}
return _array[_size - 1];
00007FFB4DCA2052 mov rax,qword ptr [rcx+8]
00007FFB4DCA2056 lea edx,[rsi-1]
00007FFB4DCA2059 cmp edx,dword ptr [rax+8]
00007FFB4DCA205C jae 00007FFB4DCA206C
00007FFB4DCA205E movsxd rdx,edx
00007FFB4DCA2061 mov rax,qword ptr [rax+rdx*8+10h]
00007FFB4DCA2066 add rsp,20h
00007FFB4DCA206A pop rsi
00007FFB4DCA206B ret
After:
if (_size == 0)
00007FFB4DC91FF0 sub rsp,28h
00007FFB4DC91FF4 cmp dword ptr [rcx+10h],0
00007FFB4DC91FF8 je 00007FFB4DC92015
{
goto Error;
}
return _array[_size - 1];
00007FFB4DC91FFA mov rax,qword ptr [rcx+8]
00007FFB4DC91FFE mov ecx,dword ptr [rcx+10h]
00007FFB4DC92001 dec ecx
00007FFB4DC92003 cmp ecx,dword ptr [rax+8]
00007FFB4DC92006 jae 00007FFB4DC92026
00007FFB4DC92008 movsxd rcx,ecx
00007FFB4DC9200B mov rax,qword ptr [rax+rcx*8+10h]
00007FFB4DC92010 add rsp,28h
00007FFB4DC92014 ret
As you can see, the second option is much shorter, and in the common case, it involves no actual jumping. This ends up being extremely efficient. Note that because we return a value from the ThrowForEmptyStack
, the assembly generated is extremely short, since we can rely on the caller to clean us up.
This was run in release mode, CoreCLR, x64. I got the assembly from the debugger, so it is possible that there are some optimizations that haven't been applied because the debugger is attached, but it is fairly close to what should happen for real, I think. Note that the ThrowForEmptyStack
is inlined, even though it is an exception only method. If we use [MethodImpl(MethodImplOptions.NoInlining)]
, it will stop it, but the goto version will still generate better code.
The end result is that we are running much less code, and that makes me happy. In general, a good guide for assembly reading is that shorter == faster, and if you are reading assembly, you are very likely in optimization mode or debugging the compiler.
I’m pretty sure that the 2.0 release of CoreCLR already fixed this kind of issue, by the way, and it should allow us to write more idiomatic code that generates very tight machine code.
Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments