Performance Gains: Using SIMD Instructions from C#

DZone 's Guide to

Performance Gains: Using SIMD Instructions from C#

· Performance Zone ·
Free Resource

Theoretically, .NET developers should never be concerned with optimizations tailored to a specific processor or instruction set. After all, the purpose of IL and JIT compilation is to allow managed applications to run on any hardware that has the .NET Framework installed, and to remain indifferent to operating system bitness, processor features, and instruction sets. However, squeezing the last bits of performance from managed applications may require reasoning at the assembly language level. At other times, understanding processor-specific features is a first step for even more significant performance gains.

Data-level parallelism, also known as Single Instruction Multiple Data (SIMD), is a feature of modern processors that enables the execution of a single instruction on a large set of data (larger than the machine word). The de-facto standard for SIMD instruction sets is SSE (Streaming SIMD Extensions), used by Intel processors since Pentium III. This instruction set adds new 128-bit registers (with the XMM prefix) as well as instructions that can operate on them. Recent Intel processors introduced Advanced Vector Extensions (AVX), which is an extension of SSE that offers 256-bit registers and even more SIMD instructions. Some examples of SSE instructions include:

  • Integer and floating-point arithmetic
  • Comparisons, shuffling, data type conversion (integer to floating-point)
  • Bitwise operations
  • Minimum, maximum, conditional copies, CRC32, population count (introduced in SSE4 and later)

You might be wondering whether instructions operating on these “new” registers are slower than their standard counterparts. If that were the case, any performance gains would be deceiving. Fortunately, that is not the case. On Intel i7 processors, a floating-point addition (FADD) instruction on 32-bit registers has throughput of one instruction per cycle and latency of 3 cycles. The equivalent ADDPS instruction on 128-bit registers also has throughput of one instruction per cycle and latency of 3 cycles.

Using these instructions in high-performance loops can provide up to 8-fold performance gains compared to naïve sequential programs that operate on a single floating-point or integer value at a time. For example, consider the following (admittedly trivial) code:

//Assume that A, B, C are equal-size float arrays
for (int i = 0; i < A.length; ++i) {
  C[i] = A[i] + B[i];

The standard code emitted by the JIT in this scenario is the following:

; ESI has A, EDI has B, ECX has C, EDX has i
xor edx,edx
cmp dword ptr [esi+4],0
fld dword ptr [esi+edx*4+8] ; load A[i], no range check
cmp edx,dword ptr [edi+4]   ; range check accessing B[i]
fadd dword ptr [edi+edx*4+8]; add B[i]
cmp edx,dword ptr [ecx+4]   ; range check accessing C[i]
fstp dword ptr [ecx+edx*4+8]; store into C[i]
inc edx
cmp dword ptr [esi+4],edx   ; are we done yet?

Each loop iteration performs a single FADD instruction that adds two 32-bit floating-point numbers. However, by using 128-bit SSE instructions, four iterations of the loop can be issued at a time, as follows (the code below performs no range checks and assumes that the number of iterations is equally divisible by 4):

xor edx, edx
; copy 16 bytes from B to xmm1
movups xmm1, xmmword ptr [edi+edx*4+8]
; copy 16 bytes from A to xmm0 
movups xmm0, xmmword ptr [esi+edx*4+8]
; add xmm0 to xmm1 and store the result in xmm1
addps xmm1, xmm0
; copy 16 bytes from xmm1 to C
movups xmmword ptr [ecx+edx*4+8], xmm1
add edx, 4 ; increase loop index by 4
cmp edx, dword ptr [esi+4]

On an AVX processor, we could move even more data around in each iteration (with the 256-bit YMM* registers), for an even bigger performance improvement:

xor edx, edx
; copy 32 bytes from B to ymm1
vmovups ymm1, ymmword ptr [edi+edx*4+8]
; copy 32 bytes from A to ymm0
vmovups ymm0, ymmword ptr [esi+edx*4+8]
; add ymm0 to ymm1 and store the result in ymm1
vaddps ymm1, ymm1, ymm0
; copy 32 bytes from ymm1 to C
vmovups ymmword ptr [ecx+edx*4+8], ymm1
add edx, 8 ; increase loop index by 8
cmp edx, dword ptr [esi+4]

The SIMD instructions used in these examples are only the tip of the iceberg. Modern applications and games use SIMD instructions to perform complex operations, including scalar product, shuffling data around in registers and memory, checksum calculation, and many others. Intel’s AVX portal is a good way to learn thoroughly what AVX can offer.

The JIT compiler uses only a small number of SSE instructions, even though they are available on practically every processor manufactured in the last 10 years. Specifically, the JIT compiler uses the SSE MOVQ instruction to copy medium-sized structures through the XMM* registers (for large structures, REP MOVS is used instead), uses SSE2 instructions for floating point to integer conversion, and other corner cases. The JIT compiler does not auto-vectorize loops by unifying iterations, as we did manually in the preceding code.

Unfortunately, C# doesn’t offer any keywords for embedding inline assembly code into your managed programs. Although you could factor out performance-sensitive parts to a C++ module and use .NET interoperability to access it, this is often clumsy and introduces a small performance penalty as the interoperability boundaries are crossed. There are two other approaches for embedding SIMD code without resorting to interoperability.

A brute-force way to run arbitrary machine code from a managed application (albeit with a light interoperability layer) is to dynamically emit the machine code and then call it. The Marshal.GetDelegateForFunctionPointer method is key, as it returns a managed delegate pointing to an unmanaged memory location, which may contain arbitrary code. The following code allocates virtual memory with the EXECUTE_READWRITE page protection, which enables us to copy code bytes into memory and then execute them. The result, on my Intel i7-860 CPU, is a more than 2-fold improvement in execution time!

delegate void VectorAddDelegate(
  float[] C, float[] B, float[] A, int length);

[DllImport("kernel32.dll", SetLastError = true)]
static extern IntPtr VirtualAlloc(
  IntPtr lpAddress, UIntPtr dwSize,
  IntPtr flAllocationType, IntPtr flProtect);

//This array of bytes has been produced from
//the SSE assembly version – it is a complete
//function that accepts four parameters (three
//vectors and length) and adds the vectors

byte[] sseAssemblyBytes = {
  0x8b, 0x5c, 0x24, 0x10, 0x8b, 0x74, 0x24, 0x0c, 0x8b,
  0x7c, 0x24, 0x08, 0x8b, 0x4c, 0x24, 0x04, 0x31, 0xd2,
  0x0f, 0x10, 0x0c, 0x97, 0x0f, 0x10, 0x04, 0x96, 0x0f,
  0x58, 0xc8, 0x0f, 0x11, 0x0c, 0x91, 0x83, 0xc2, 0x04,
  0x39, 0xda, 0x7f, 0xea, 0xc2, 0x10, 0x00 };

IntPtr codeBuffer = VirtualAlloc(
  IntPtr.Zero, new UIntPtr((uint)sseAssemblyBytes.Length),
  0x1000 | 0x2000, //MEM_COMMIT | MEM_RESERVE

Marshal.Copy(sseAssemblyBytes, 0,
  codeBuffer, sseAssemblyBytes.Length);

VectorAddDelegate addVectors = (VectorAddDelegate)
    codeBuffer, typeof(VectorAddDelegate));
//We can now use ‘addVectors’ to add vectors!

A completely different approach, which unfortunately isn’t available on the Microsoft CLR, is extending the JIT compiler to emit SIMD instructions. This is the approach taken by Mono.Simd. Managed code developers who use the Mono .NET runtime can reference the Mono.Simd assembly and use JIT compiler support that converts operations on types such as Vector16b or Vector4f to the appropriate SSE instructions.


Published at DZone with permission of Sasha Goldshtein , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}