DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

SIMD-Optimized C++ Code in Visual Studio 11

Sasha Goldshtein user avatar by
Sasha Goldshtein
·
Dec. 22, 12 · Interview
Like (0)
Save
Tweet
Share
5.98K Views

Join the DZone community and get the full member experience.

Join For Free

The C++ compiler in Visual Studio 11 has another neat optimization feature up its sleeve. Unlike intrusive features, such as running code on the GPU using the AMP extensions, this one requires no additional compilation switches and no changes – even the slightest – to the code.

The new compiler will use SIMD (Single Instruction Multiple Data) instructions from the SSE/SSE2 and AVX family to "parallelize" loops. This is not the standard, thread-level parallelism, which runs certain iterations of the loop in parallel. This is the processor’s inherent ability to execute operations on individual parts of large data elements in parallel.

The following trivial example illustrates the benefits of this optimization. Suppose you want to sum two vectors of floating-point numbers, element-by-element. The following C/C++ loop performs this task:

    for (int i = 0; i < N; ++i)
        C[i] = A[i] + B[i];

The current VC++ compiler compiles this loop to the following 32-bit code with optimizations:

013E105A  xor         eax,eax 
013E105C  lea         esp,[esp] 
013E1060  fld         dword ptr B[eax] 
013E1067  add         eax,28h 
013E106A  fadd        dword ptr [ebp+eax-0FCCh] 
013E1071  fstp        dword ptr [ebp+eax-2F0Ch] 
013E1078  fld         dword ptr [ebp+eax-0FC8h] 
013E107F  fadd        dword ptr [ebp+eax-1F68h] 
013E1086  fstp        dword ptr [ebp+eax-2F08h] 
013E108D  fld         dword ptr [ebp+eax-0FC4h] 
013E1094  fadd        dword ptr [ebp+eax-1F64h] 
013E109B  fstp        dword ptr [ebp+eax-2F04h] 
013E10A2  fld         dword ptr [ebp+eax-0FC0h] 
013E10A9  fadd        dword ptr [ebp+eax-1F60h] 
013E10B0  fstp        dword ptr [ebp+eax-2F00h] 
013E10B7  fld         dword ptr [ebp+eax-0FBCh] 
013E10BE  fadd        dword ptr [ebp+eax-1F5Ch] 
013E10C5  fstp        dword ptr [ebp+eax-2EFCh] 
013E10CC  fld         dword ptr [ebp+eax-0FB8h] 
013E10D3  fadd        dword ptr [ebp+eax-1F58h] 
013E10DA  fstp        dword ptr [ebp+eax-2EF8h] 
013E10E1  fld         dword ptr [ebp+eax-0FB4h] 
013E10E8  fadd        dword ptr [ebp+eax-1F54h] 
013E10EF  fstp        dword ptr [ebp+eax-2EF4h] 
013E10F6  fld         dword ptr [ebp+eax-0FB0h] 
013E10FD  fadd        dword ptr [ebp+eax-1F50h] 
013E1104  fstp        dword ptr [ebp+eax-2EF0h] 
013E110B  fld         dword ptr [ebp+eax-0FACh] 
013E1112  fadd        dword ptr [ebp+eax-1F4Ch] 
013E1119  fstp        dword ptr [ebp+eax-2EECh] 
013E1120  fld         dword ptr [ebp+eax-0FA8h] 
013E1127  fadd        dword ptr [ebp+eax-1F48h] 
013E112E  fstp        dword ptr i[eax] 
013E1135  cmp         eax,0FA0h 
013E113A  jb          wmain+60h (013E1060h) 

Note the aggressive loop unrolling employed by the compiler – each iteration of this loop will perform 10 operations.

The new VC++ compiler compiles the loop to the following 32-bit code with optimizations:

00381041  xor         eax,eax 
00381043  jmp         wmain+50h (0381050h) 
00381045  lea         esp,[esp] 
0038104C  lea         esp,[esp] 
00381050  movups      xmm1,xmmword ptr B[eax] 
00381058  movups      xmm0,xmmword ptr A[eax] 
00381060  add         eax,10h 
00381063  addps       xmm1,xmm0 
00381066  movups      xmmword ptr [ebp+eax-2EF4h],xmm1 
0038106E  cmp         eax,0FA0h 
00381073  jb          wmain+50h (0381050h) 

This time, each iteration of the loop performs 4 operations, by using the SIMD instructions MOVUPS and ADDPS. The first, MOVUPS, copies four floating-point values from memory to registers and the other way around. The second, ADDPS, adds four floating-point values that are packed next to each other in two registers.

What's the performance difference? On my Intel i7-860 processor, there is exactly a 2x difference between the two compiler toolsets.*

The loop above is a silly example, but it shows the potential of automatic optimization. Using SIMD instructions from C++ programs – up to now – relied on dropping to low-level intrinsics such as _mm_add_ps, and low-level types such as __m128. I’m willing to bet that most C++ developers have never considered using these intrinsics in their programs. That’s why this is an important feature, and just a tiny step in the right direction.


* It’s worth mentioning that the VC++11 compiler can produce AVX instructions (operating on 256 bit YMMx registers), which should be even faster, but this is not the default. My first-generation i7 processor doesn’t support them – feel free to check them out on a Sandy Bridge processor and let me know if it helps.

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Top 10 Secure Coding Practices Every Developer Should Know
  • Beginners’ Guide to Run a Linux Server Securely
  • Easy Smart Contract Debugging With Truffle’s Console.log
  • Using the PostgreSQL Pager With MariaDB Xpand

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: