Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

.NET Performance - Choosing a Collection is a Matter of Cache

DZone's Guide to

.NET Performance - Choosing a Collection is a Matter of Cache

Free Resource

Discover 50 of the latest mobile performance statistics with the Ultimate Guide to Digital Experience Monitoring, brought to you in partnership with Catchpoint.

This is a short excerpt from Chapter 5 (Collections and Generics) of Pro .NET Performance, scheduled to appear in August 2012. I might be publishing a few more of these before and after the book is out.

Choosing the right collection is not only about its performance considerations. The way data are laid out in memory is often more critical to CPU-bound applications than any other criterion, and collections affect this layout greatly. The main factor behind carefully examining data layout in memory is the CPU cache.

Modern systems ship with large main memories. Eight GB of memory is fairly standard on a mid-range workstation or a gaming laptop. Fast DDR3 SDRAM memory units are capable of ∼15ns memory access latencies, and theoretical transfer rates of ∼15GB/s. Fast processors, on the other hand, can issue billions of instructions per second; theoretically speaking, stalling for 15ns while waiting for memory access can prevent the execution of dozens (and sometimes hundreds) of CPU instructions. Stalling for memory access is the phenomenon known as hitting the memory wall.

To distance applications from this wall, modern processors are equipped with several levels of cache memory, which has different internal characteristics than main memory and tends to be very expensive and relatively small. For example, my desktop’s Intel i7-860 processor ships with three cache levels:

  • Level 1 cache for program instructions, 32KB, one for each core (total of 4 caches).
  • Level 1 cache for data, 32KB, one for each core (total of 4 caches).
  • Level 2 cache for data, 256KB, one for each core (total of 4 caches).
  • Level 3 cache for data, 8MB, shared (total of 1 cache).

image

When the processor attempts to access a memory address, it begins by checking whether the data is already in its L1 cache. If it is, the memory access is satisfied from the cache, which takes ∼5 CPU cycles (this is called a cache hit). If it isn’t, the L2 cache is checked; satisfying the access from the L2 cache takes ∼10 cycles. Similarly, satisfying the access from L3 cache takes ∼40 cycles. Finally, if the data isn’t in any of the cache levels, the processor will stall for the system’s main memory (this is called a cache miss). When the processor accesses main memory, it reads from it not a single byte or word, but a cache line, which on modern systems consists of 32 or 64 bytes. Accessing any word on the same cache line will not involve another cache miss until the line is evicted from the cache.

To calculate cache access latencies on actual hardware, you can consult hardware specs or use a live benchmarking utility like CPUID’s cache latency computation tool. Here is an example of its output on my desktop PC:

Cache latency computation, ver 1.0
www.cpuid.com

Computing ...
<diagnostic output skipped>

3 cache levels detected
Level 1         size = 32Kb     latency = 3 cycles
Level 2         size = 256Kb    latency = 10 cycles
Level 3         size = 8192Kb   latency = 39 cycles

Although this description does not do justice to the true hardware complexities of how SRAM caches and DRAM memories operate, it provides enough food for thought and discussion of how our high-level software algorithms can be affected by data layout in memory. We now consider a simple example that involves a single core’s cache. (There are interesting cases to consider where multiple caches from different cores are involved—Chapter 6 in the book covers these in more detail.)

Suppose the task at hand is traversing a large collection of integers and performing some aggregation on them, such as finding their sum or average. Below are two alternatives; one accesses the data from a LinkedList<int> and the other from an array of integers (int[]), two of the built-in .NET collections.

LinkedList<int> numbers = new LinkedList<int>(
    Enumerable.Range(0, 20000000));
int sum = 0;
for (LinkedListNode<int> curr = numbers.First;
     curr != null; curr = curr.Next)
{
  sum += curr.Value;
}

int[] numbers = Enumerable.Range(0, 20000000).ToArray();
int sum = 0;
for (int curr = 0; curr < numbers.Length; ++curr)
{
  sum += numbers[curr];
}

The second version of the code runs 2× faster than the first on my desktop. This is a non-negligible difference, and if you only consider the number of CPU instructions issued, you might not be convinced that there should be a difference at all. After all, traversing a linked list involves moving from one node to the next, and traversing an array involves incrementing an index into the array. (In fact, without JIT optimizations, accessing an array element would require also a range check to make sure that the index is within the array’s bounds.)

  ; possible x86 assembly for the first loop
  ; assume ‘sum’ is in EAX and ‘numbers’ is in ECX
  xor eax, eax
  mov ecx, dword ptr [ecx+4] ; curr = numbers.First
  test ecx, ecx
  jz LOOP_END
LOOP_BEGIN:
  add eax, dword ptr [ecx+10] ; sum += curr.Value
  mov ecx, dword ptr [ecx+8] ; curr = curr.Next
  test ecx, ecx
  jnz LOOP_BEGIN ; total of 4 instructions per iteration
LOOP_END:
  ...

  ; possible x86 assembly for the second loop
  ; assume ‘sum’ is in EAX and ‘numbers’ is in ECX
  mov edi, dword ptr [ecx+4] ; numbers.Length
  test edi, edi
  jz LOOP_END
  xor edx, edx ; loop index
LOOP_BEGIN:
  add eax, dword ptr [ecx+edx*4+8] ; sum += numbers[i]
  inc edx
  cmp esi, edx
  jg LOOP_BEGIN ; total of 4 instructions per iteration
LOOP_END:
  ...

Given this code generation for both loops (and barring optimizations such as using SIMD instructions to traverse the array, which is contiguous in memory), it is hard to explain the significant performance difference by inspecting only the instructions executed. Indeed, we must analyze the memory access patterns of this code to reach any acceptable conclusions.

In both loops, each integer is accessed only once, and it would seem that cache considerations are not as critical because there is no reusable data to benefit from cache hits. Still, the way the data is laid out in memory affects greatly the performance of this program — not because the data is reused, but because of the way it is brought into memory. When accessing the array elements, a cache miss at the beginning of a cache line brings into the cache 16 consecutive integers (cache line = 64 bytes = 16 integers). Because array access is sequential, the next 15 integers are now in the cache and can be accessed without a cache miss. This is an almost-ideal scenario, with a 1:16 cache miss ratio. On the other hand, when accessing linked list elements, a cache miss at the beginning of a cache line can bring into cache at most 3 consecutive linked list nodes, a 1:4 cache miss ratio! (A node consists of a back pointer, forward pointer, and integer datum, which occupy 12 bytes on a 32-bit system; the reference type header brings the tally to 20 bytes per node.)

The much higher cache miss ratio is the reason for most of the performance difference between the two pieces of code we tested above. Furthermore, we are assuming the ideal scenario in which all linked list nodes are positioned sequentially in memory, which would be the case only if they were allocated simultaneously with no other allocations taking place, and is fairly unlikely. Had the linked list nodes been distributed less ideally in memory, the cache miss ratio would have been even higher, and the performance even poorer.

More practical examples—and everything you will ever want to know about CPU caches and performance considerations of system memory—are covered in Ulrich Drepper’s excellent essay, “What Every Programmer Should Know About Memory”. I recommend re-reading this paper at least once a year :-)

Is your APM strategy broken? This ebook explores the latest in Gartner research to help you learn how to close the end-user experience gap in APM, brought to you in partnership with Catchpoint.

Topics:

Published at DZone with permission of Sasha Goldshtein, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}