On ‘stackalloc’ Performance and The Large Object Heap
The Performance Zone is presented by AppDynamics. Scalability and better performance are constant concerns for the developer and operations manager. Try AppDynamics' fully-featured performance tool for Java, .NET, PHP, & Node.js.
An interesting blog post is making the rounds on Twitter, 10 Things You Maybe Didn’t Know About C#. There are some nice points in there, such as using the FieldOffset attribute to create unions or specifying custom add/remove accessors for events.
However, item #4 on the list claims that using stackalloc is not faster than allocating a standard array. The proof is given in form of a benchmark program that allocates 10,000 element arrays – so far so good – and then proceeds to store values in them. The values are obtained by using Math.Pow. The benchmark results show that using a stackalloc-ed array is not faster than using a heap-allocated array. I was able to successfully reproduce this result on my machine.
Unfortunately, this benchmark cannot support the claim “Using stackalloc is not faster than allocating a standard array”. This benchmark shows that if you need to calculate and store powers of large numbers in an array, it doesn’t matter from a speed perspective if you store them in a heap-allocated array or in a stack-allocated array.
The benchmark also has additional problems – it does not allow the program to warm up, it does not display statistics over the results (average, min, max, standard deviation), and it only uses a single array size, namely 10,000 elements. I refer you to Eric Lippert’s blog series on micro-benchmarking, and also Chapter 2 in Pro .NET Performance [available in part on my blog], for a more detailed treatment of micro-benchmarking best practices.
I made a few changes to the benchmark program so that it:
- Allocates arrays of varying sizes and compares the performance for each size. The sizes I used were 10 – 4010 (step 500) and 6000 – 96000 (step 10,000).
- Repeats the measurement 5 times and displays the average, min, and max times.
- Warms up by calling each function at least once before measuring anything.
The final version of the benchmark I used is here.
The raw results are below (times are in milliseconds):
Array size: 10 Heap Array - AVG = 0.85394, MIN = 0.5395, MAX = 1.0125 Stack Array - AVG = 0.23302, MIN = 0.2055, MAX = 0.3028 Array size: 510 Heap Array - AVG = 7.22854, MIN = 6.9549, MAX = 7.4369 Stack Array - AVG = 6.38972, MIN = 5.8119, MAX = 6.8206 Array size: 1010 Heap Array - AVG = 12.53882, MIN = 10.0942, MAX = 14.1942 Stack Array - AVG = 10.92058, MIN = 10.5542, MAX = 11.5668 Array size: 1510 Heap Array - AVG = 17.6329, MIN = 13.9484, MAX = 19.9116 Stack Array - AVG = 15.12048, MIN = 14.0906, MAX = 16.5836 Array size: 2010 Heap Array - AVG = 21.68702, MIN = 20.5229, MAX = 22.5228 Stack Array - AVG = 19.7369, MIN = 19.0708, MAX = 21.2995 Array size: 2510 Heap Array - AVG = 22.80914, MIN = 22.2542, MAX = 23.0967 Stack Array - AVG = 23.6264, MIN = 23.4815, MAX = 23.7922 Array size: 3010 Heap Array - AVG = 27.27846, MIN = 26.4928, MAX = 28.2462 Stack Array - AVG = 27.96436, MIN = 27.798, MAX = 28.1741 Array size: 3510 Heap Array - AVG = 32.08726, MIN = 30.7842, MAX = 35.0712 Stack Array - AVG = 33.52876, MIN = 32.6167, MAX = 35.2137 Array size: 4010 Heap Array - AVG = 35.50722, MIN = 35.0341, MAX = 36.1861 Stack Array - AVG = 37.25514, MIN = 37.1412, MAX = 37.4977 Array size: 4510 Heap Array - AVG = 39.46634, MIN = 39.2239, MAX = 40.0683 Stack Array - AVG = 43.32186, MIN = 41.906, MAX = 46.0319 Array size: 6000 Heap Array - AVG = 51.58616, MIN = 51.2831, MAX = 52.1263 Stack Array - AVG = 56.75872, MIN = 55.4964, MAX = 58.3105 Array size: 16000 Heap Array - AVG = 139.4916, MIN = 136.2144, MAX = 148.4627 Stack Array - AVG = 148.48424, MIN = 148.1188, MAX = 148.7517 Array size: 26000 Heap Array - AVG = 405.05262, MIN = 398.2785, MAX = 412.4401 Stack Array - AVG = 242.50534, MIN = 242.1933, MAX = 242.9641 Array size: 36000 Heap Array - AVG = 570.28486, MIN = 549.2248, MAX = 631.8309 Stack Array - AVG = 339.10658, MIN = 335.6826, MAX = 341.6048 Array size: 46000 Heap Array - AVG = 716.25066, MIN = 692.9255, MAX = 741.0803 Stack Array - AVG = 435.29708, MIN = 431.5396, MAX = 442.7383 Array size: 56000 Heap Array - AVG = 871.52846, MIN = 862.8028, MAX = 881.2801 Stack Array - AVG = 532.46454, MIN = 526.9268, MAX = 536.4717 Array size: 66000 Heap Array - AVG = 1015.86848, MIN = 995.1994, MAX = 1040.2593 Stack Array - AVG = 627.60132, MIN = 623.1848, MAX = 637.2896 Array size: 76000 Heap Array - AVG = 1201.39392, MIN = 1159.1772, MAX = 1236.559 Stack Array - AVG = 725.7904, MIN = 717.0982, MAX = 730.3976 Array size: 86000 Heap Array - AVG = 1361.79128, MIN = 1336.9031, MAX = 1394.9041 Stack Array - AVG = 819.4298, MIN = 812.741, MAX = 827.5579 Array size: 96000 Heap Array - AVG = 1554.59876, MIN = 1538.4992, MAX = 1577.4435 Stack Array - AVG = 916.43956, MIN = 911.0777, MAX = 919.0551
If you plot the averages, you get the following:
It’s immediately evident that something happened between 16,000 and 26,000 elements that created a gap in the results (in favor of stack allocations) – and this gap is only getting bigger when you increase the array sizes.
That “something” is the large object heap. Objects larger than 85,000 bytes are allocated from the large object heap, which 1) is not compacted and 2) is collected only with a full generation 2 collection. 85,000 bytes are enough for 16,000 integers but not enough for 26,000 integers.
We can conclude that the larger the allocated arrays are, the bigger the overhead of using the large object heap to allocate them. On the other hand, we can’t really allocate arbitrarily-sized arrays from the stack at all. The default stack size for a Windows thread is 1MB, so allocating an array with 100,000 integers is already a considerable portion of that space. But that’s a whole different question.
To summarize, we were able to show that stack allocations are faster than large object heap allocations if you consider the GC times as part of the benchmark.