Over a million developers have joined DZone.

SIMD-Optimized C++ Code in Visual Studio 11

DZone's Guide to

SIMD-Optimized C++ Code in Visual Studio 11

· Performance Zone ·
Free Resource

Sensu is an open source monitoring event pipeline. Try it today.

The C++ compiler in Visual Studio 11 has another neat optimization feature up its sleeve. Unlike intrusive features, such as running code on the GPU using the AMP extensions, this one requires no additional compilation switches and no changes – even the slightest – to the code.

The new compiler will use SIMD (Single Instruction Multiple Data) instructions from the SSE/SSE2 and AVX family to "parallelize" loops. This is not the standard, thread-level parallelism, which runs certain iterations of the loop in parallel. This is the processor’s inherent ability to execute operations on individual parts of large data elements in parallel.

The following trivial example illustrates the benefits of this optimization. Suppose you want to sum two vectors of floating-point numbers, element-by-element. The following C/C++ loop performs this task:

    for (int i = 0; i < N; ++i)
        C[i] = A[i] + B[i];

The current VC++ compiler compiles this loop to the following 32-bit code with optimizations:

013E105A  xor         eax,eax 
013E105C  lea         esp,[esp] 
013E1060  fld         dword ptr B[eax] 
013E1067  add         eax,28h 
013E106A  fadd        dword ptr [ebp+eax-0FCCh] 
013E1071  fstp        dword ptr [ebp+eax-2F0Ch] 
013E1078  fld         dword ptr [ebp+eax-0FC8h] 
013E107F  fadd        dword ptr [ebp+eax-1F68h] 
013E1086  fstp        dword ptr [ebp+eax-2F08h] 
013E108D  fld         dword ptr [ebp+eax-0FC4h] 
013E1094  fadd        dword ptr [ebp+eax-1F64h] 
013E109B  fstp        dword ptr [ebp+eax-2F04h] 
013E10A2  fld         dword ptr [ebp+eax-0FC0h] 
013E10A9  fadd        dword ptr [ebp+eax-1F60h] 
013E10B0  fstp        dword ptr [ebp+eax-2F00h] 
013E10B7  fld         dword ptr [ebp+eax-0FBCh] 
013E10BE  fadd        dword ptr [ebp+eax-1F5Ch] 
013E10C5  fstp        dword ptr [ebp+eax-2EFCh] 
013E10CC  fld         dword ptr [ebp+eax-0FB8h] 
013E10D3  fadd        dword ptr [ebp+eax-1F58h] 
013E10DA  fstp        dword ptr [ebp+eax-2EF8h] 
013E10E1  fld         dword ptr [ebp+eax-0FB4h] 
013E10E8  fadd        dword ptr [ebp+eax-1F54h] 
013E10EF  fstp        dword ptr [ebp+eax-2EF4h] 
013E10F6  fld         dword ptr [ebp+eax-0FB0h] 
013E10FD  fadd        dword ptr [ebp+eax-1F50h] 
013E1104  fstp        dword ptr [ebp+eax-2EF0h] 
013E110B  fld         dword ptr [ebp+eax-0FACh] 
013E1112  fadd        dword ptr [ebp+eax-1F4Ch] 
013E1119  fstp        dword ptr [ebp+eax-2EECh] 
013E1120  fld         dword ptr [ebp+eax-0FA8h] 
013E1127  fadd        dword ptr [ebp+eax-1F48h] 
013E112E  fstp        dword ptr i[eax] 
013E1135  cmp         eax,0FA0h 
013E113A  jb          wmain+60h (013E1060h) 

Note the aggressive loop unrolling employed by the compiler – each iteration of this loop will perform 10 operations.

The new VC++ compiler compiles the loop to the following 32-bit code with optimizations:

00381041  xor         eax,eax 
00381043  jmp         wmain+50h (0381050h) 
00381045  lea         esp,[esp] 
0038104C  lea         esp,[esp] 
00381050  movups      xmm1,xmmword ptr B[eax] 
00381058  movups      xmm0,xmmword ptr A[eax] 
00381060  add         eax,10h 
00381063  addps       xmm1,xmm0 
00381066  movups      xmmword ptr [ebp+eax-2EF4h],xmm1 
0038106E  cmp         eax,0FA0h 
00381073  jb          wmain+50h (0381050h) 

This time, each iteration of the loop performs 4 operations, by using the SIMD instructions MOVUPS and ADDPS. The first, MOVUPS, copies four floating-point values from memory to registers and the other way around. The second, ADDPS, adds four floating-point values that are packed next to each other in two registers.

What's the performance difference? On my Intel i7-860 processor, there is exactly a 2x difference between the two compiler toolsets.*

The loop above is a silly example, but it shows the potential of automatic optimization. Using SIMD instructions from C++ programs – up to now – relied on dropping to low-level intrinsics such as _mm_add_ps, and low-level types such as __m128. I’m willing to bet that most C++ developers have never considered using these intrinsics in their programs. That’s why this is an important feature, and just a tiny step in the right direction.

* It’s worth mentioning that the VC++11 compiler can produce AVX instructions (operating on 256 bit YMMx registers), which should be even faster, but this is not the default. My first-generation i7 processor doesn’t support them – feel free to check them out on a Sandy Bridge processor and let me know if it helps.

Sensu: workflow automation for monitoring. Learn more—download the whitepaper.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}