SIMD-Optimized C++ Code in Visual Studio 11
Join the DZone community and get the full member experience.
Join For FreeThe C++ compiler in Visual Studio 11 has another neat optimization feature up its sleeve. Unlike intrusive features, such as running code on the GPU using the AMP extensions, this one requires no additional compilation switches and no changes – even the slightest – to the code.
The new compiler will use SIMD (Single Instruction Multiple Data) instructions from the SSE/SSE2 and AVX family to "parallelize" loops. This is not the standard, thread-level parallelism, which runs certain iterations of the loop in parallel. This is the processor’s inherent ability to execute operations on individual parts of large data elements in parallel.
The following trivial example illustrates the benefits of this optimization. Suppose you want to sum two vectors of floating-point numbers, element-by-element. The following C/C++ loop performs this task:
for (int i = 0; i < N; ++i)
C[i] = A[i] + B[i];
The current VC++ compiler compiles this loop to the following 32-bit code with optimizations:
013E105A xor eax,eax
013E105C lea esp,[esp]
013E1060 fld dword ptr B[eax]
013E1067 add eax,28h
013E106A fadd dword ptr [ebp+eax-0FCCh]
013E1071 fstp dword ptr [ebp+eax-2F0Ch]
013E1078 fld dword ptr [ebp+eax-0FC8h]
013E107F fadd dword ptr [ebp+eax-1F68h]
013E1086 fstp dword ptr [ebp+eax-2F08h]
013E108D fld dword ptr [ebp+eax-0FC4h]
013E1094 fadd dword ptr [ebp+eax-1F64h]
013E109B fstp dword ptr [ebp+eax-2F04h]
013E10A2 fld dword ptr [ebp+eax-0FC0h]
013E10A9 fadd dword ptr [ebp+eax-1F60h]
013E10B0 fstp dword ptr [ebp+eax-2F00h]
013E10B7 fld dword ptr [ebp+eax-0FBCh]
013E10BE fadd dword ptr [ebp+eax-1F5Ch]
013E10C5 fstp dword ptr [ebp+eax-2EFCh]
013E10CC fld dword ptr [ebp+eax-0FB8h]
013E10D3 fadd dword ptr [ebp+eax-1F58h]
013E10DA fstp dword ptr [ebp+eax-2EF8h]
013E10E1 fld dword ptr [ebp+eax-0FB4h]
013E10E8 fadd dword ptr [ebp+eax-1F54h]
013E10EF fstp dword ptr [ebp+eax-2EF4h]
013E10F6 fld dword ptr [ebp+eax-0FB0h]
013E10FD fadd dword ptr [ebp+eax-1F50h]
013E1104 fstp dword ptr [ebp+eax-2EF0h]
013E110B fld dword ptr [ebp+eax-0FACh]
013E1112 fadd dword ptr [ebp+eax-1F4Ch]
013E1119 fstp dword ptr [ebp+eax-2EECh]
013E1120 fld dword ptr [ebp+eax-0FA8h]
013E1127 fadd dword ptr [ebp+eax-1F48h]
013E112E fstp dword ptr i[eax]
013E1135 cmp eax,0FA0h
013E113A jb wmain+60h (013E1060h)
Note the aggressive loop unrolling employed by the compiler – each iteration of this loop will perform 10 operations.
The new VC++ compiler compiles the loop to the following 32-bit code with optimizations:
00381041 xor eax,eax
00381043 jmp wmain+50h (0381050h)
00381045 lea esp,[esp]
0038104C lea esp,[esp]
00381050 movups xmm1,xmmword ptr B[eax]
00381058 movups xmm0,xmmword ptr A[eax]
00381060 add eax,10h
00381063 addps xmm1,xmm0
00381066 movups xmmword ptr [ebp+eax-2EF4h],xmm1
0038106E cmp eax,0FA0h
00381073 jb wmain+50h (0381050h)
This time, each iteration of the loop performs 4 operations, by using the SIMD instructions MOVUPS and ADDPS. The first, MOVUPS, copies four floating-point values from memory to registers and the other way around. The second, ADDPS, adds four floating-point values that are packed next to each other in two registers.
What's the performance difference? On my Intel i7-860 processor, there is exactly a 2x difference between the two compiler toolsets.*
The loop above is a silly example, but it shows the potential of automatic optimization. Using SIMD instructions from C++ programs – up to now – relied on dropping to low-level intrinsics such as _mm_add_ps, and low-level types such as __m128. I’m willing to bet that most C++ developers have never considered using these intrinsics in their programs. That’s why this is an important feature, and just a tiny step in the right direction.
* It’s worth mentioning that the VC++11 compiler can produce AVX instructions (operating on 256 bit YMMx registers), which should be even faster, but this is not the default. My first-generation i7 processor doesn’t support them – feel free to check them out on a Sandy Bridge processor and let me know if it helps.
Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Building and Deploying Microservices With Spring Boot and Docker
-
Building A Log Analytics Solution 10 Times More Cost-Effective Than Elasticsearch
-
How To Design Reliable IIoT Architecture
-
Build a Simple Chat Server With gRPC in .Net Core
Comments