Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

GoTo-Based Optimizations

DZone's Guide to

GoTo-Based Optimizations

Some of the changes we saw after optimizing our internal data structures in RavenDB and are pretty strange. This article will help you understand them.

· Performance Zone
Free Resource

One of the things that we did recently was going over our internal data structures in RavenDB and seeing if we could optimize them. Some of those changes are pretty strange if you aren’t following what is actually going on. Here is an example:

Before:

Image title

After:

Image title

What is the point in this kind of change? Well, let us look at the actual assembly generated by this, shall we?

Before:

            if (_size == 0)
00007FFB4DCA1FF0  push        rsi  
00007FFB4DCA1FF1  sub         rsp,20h  
00007FFB4DCA1FF5  mov         esi,dword ptr [rcx+10h]  
00007FFB4DCA1FF8  test        esi,esi  
00007FFB4DCA1FFA  jne         00007FFB4DCA2052  
            {
                ThrowForEmptyStack();
00007FFB4DCA1FFC  mov         rcx,7FFBACA881D0h  
00007FFB4DCA2006  call        00007FFBAD9A79D0  
00007FFB4DCA200B  mov         rsi,rax  
00007FFB4DCA200E  mov         rcx,rsi  
00007FFB4DCA2011  call        00007FFBAC7014C0  
00007FFB4DCA2016  mov         ecx,1  
00007FFB4DCA201B  mov         rdx,7FFB4DB45010h  
00007FFB4DCA2025  call        00007FFBAD76B420  
00007FFB4DCA202A  lea         rcx,[rsi+20h]  
00007FFB4DCA202E  mov         rdx,rax  
00007FFB4DCA2031  call        00007FFBAD9A5B00  
00007FFB4DCA2036  mov         dword ptr [rsi+84h],80131501h  
00007FFB4DCA2040  mov         dword ptr [rsi+84h],80131509h  
00007FFB4DCA204A  mov         rcx,rsi  
00007FFB4DCA204D  call        00007FFBAD76DC40  
            }
            return _array[_size - 1];
00007FFB4DCA2052  mov         rax,qword ptr [rcx+8]  
00007FFB4DCA2056  lea         edx,[rsi-1]  
00007FFB4DCA2059  cmp         edx,dword ptr [rax+8]  
00007FFB4DCA205C  jae         00007FFB4DCA206C  
00007FFB4DCA205E  movsxd      rdx,edx  
00007FFB4DCA2061  mov         rax,qword ptr [rax+rdx*8+10h]  
00007FFB4DCA2066  add         rsp,20h  
00007FFB4DCA206A  pop         rsi  
00007FFB4DCA206B  ret  

After:

            if (_size == 0)
00007FFB4DC91FF0  sub         rsp,28h  
00007FFB4DC91FF4  cmp         dword ptr [rcx+10h],0  
00007FFB4DC91FF8  je          00007FFB4DC92015  
            {
                goto Error;
            }
            return _array[_size - 1];
00007FFB4DC91FFA  mov         rax,qword ptr [rcx+8]  
00007FFB4DC91FFE  mov         ecx,dword ptr [rcx+10h]  
00007FFB4DC92001  dec         ecx  
00007FFB4DC92003  cmp         ecx,dword ptr [rax+8]  
00007FFB4DC92006  jae         00007FFB4DC92026  
00007FFB4DC92008  movsxd      rcx,ecx  
00007FFB4DC9200B  mov         rax,qword ptr [rax+rcx*8+10h]  
00007FFB4DC92010  add         rsp,28h  
00007FFB4DC92014  ret  

As you can see, the second option is much shorter, and in the common case, it involves no actual jumping. This ends up being extremely efficient. Note that because we return a value from the ThrowForEmptyStack , the assembly generated is extremely short, since we can rely on the caller to clean us up.

This was run in release mode, CoreCLR, x64. I got the assembly from the debugger, so it is possible that there are some optimizations that haven't been applied because the debugger is attached, but it is fairly close to what should happen for real, I think. Note that the ThrowForEmptyStack is inlined, even though it is an exception only method. If we use [MethodImpl(MethodImplOptions.NoInlining)] , it will stop it, but the goto version will still generate better code.

The end result is that we are running much less code, and that makes me happy. In general, a good guide for assembly reading is that shorter == faster, and if you are reading assembly, you are very likely in optimization mode or debugging the compiler.

I’m pretty sure that the 2.0 release of CoreCLR already fixed this kind of issue, by the way, and it should allow us to write more idiomatic code that generates very tight machine code.

Topics:
performance ,ravendb ,optmization

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}