Optimize Memory Access to Increase Your Coding Performance
Learn about common memory coding problems as well as best practices and tips to help you increase your code performance.
Join the DZone community and get the full member experience.
Join For FreeWhile CPUs have become blindingly fast in the past 20 years, memory access times have remained relatively constant. To get the most performance out of coding, serious programmers must not only choose the best algorithms but also need to design for highly efficient memory access. Although compilers have improved in recent years, they are unable to detect and remap most memory related coding oversights that could be avoided. For this reason, avoiding pitfalls associated with memory access requires careful analysis.
In this article, some common memory coding problems will be briefly described together with some best practices tips that can increase code performance. Many of the issues have to do with understanding virtual memory concepts, such as the difference between stack and heap memory, as well as understanding more subtle interactions of program execution with the cache hierarchy.
Memory Available to Executing Thread
Two high-level virtual memory concepts for program execution are the stack and heap. The stack refers to that memory allocated to an executing thread and consists of local variables, addresses for executing functions, and other process information. Each thread has its own stack. The heap is dynamically allocated memory; it is referred to as the free store and is a shared memory pool accessible by all executing threads. Memory on the stack is much faster to access because variables are arranged in a contiguous LIFO, while the memory on the "heap" is relatively slower since it is created randomly across RAM in blocks, requiring more complex and multithreading-safe management.
Variables stored on the stack are static variables, such as fixed sized arrays (e.g. A[100]) that are not able to grow in size during thread execution. Once these variables on the stack go out of scope, or a running function completes, they go out of scope and are removed from the stack. Dynamically variables created on the "heap" with "new" in C++ or "malloc ()" in C, will remain on the heap until manually deallocated in the program code using one of the built in keywords like free, delete, or delete[].
For the purpose of coding, the stack is faster because the access pattern is contiguous and consists of incrementing or decrementing a pointer, thereby making allocation and deallocation trivial. Also, the stack efficiently reuses bytes that are directly mapped to the cache. A disadvantage of the stack is that variables cannot be resized during execution and upon many recursive calls, the stack memory quickly fills up, potentially producing stack overflow. In contrast, the variables created at runtime on the heap, which is available to all threads, will have relatively slower access due to several factors associated with virtual memory management. Also, because the heap is a global resource, multithreaded synchronization overhead degrades access time. A common technique is to create large blocks of heap and dole these out to arrays when needed, essentially creating a local memory manager.
In the next section, more details of the cache hierarchy and best practices of memory allocation are described.
The Memory Hierarchy and Performance
At the heart of virtual memory management is the need to coordinate blocks and pages of memory across the memory hierarchy of modern computer architectures. This is difficult because both access times and size vary considerably down the hierarchy [2]. The L1 cache can be accessed within 1-5 cycles and have sizes of ~32-64K, while L2 cache consumes 5-20 cycles upon access, with larger sizes of ~128-512K. These on-chip caches are considerably faster than main memory, or RAM, requiring more than 40-100 cycles.
Thus, significant performance degradation arises from non-optimal use of the cache hierarchy. Problems associated with reading and writing to and from the caches can be summarized as the 3 C's of cache inefficiency:
- Compulsory misses: the first time a page is loaded will produce unavoidable misses.
- Capacity misses: this occurs when all the active data cannot fit into the available cache space; this will cause reloading of cache pages.
- Conflict misses: when data is mapped to same cache lines, it can result in cache thrashing.
Since inefficient cache utilization results in lower performance, coding practices must be cache-aware to improve program execution efficiency. While modern compilers can optimize code in many ways, they are unable to fix some of the most blatant problems that can occur in memory access. The good news is that there are common practices that can help avoid such case.
Practical Coding Tips
A useful mnemonic for remembering the class of methods that are used to improve memory access performance is given by the 3 R's [1][3], as follows:
- Rearrange (code, data): Change layout to increase spatial locality.
- Reduce (size, number of cache lines read): Smaller/smarter formats, compression.
- Reuse (cache lines): Increase temporal (and spatial) locality.
Rearranging code and data refer to a set of techniques that optimize the mapping to the cache pages. Two examples are memory access order and memory alignment. To avoid cache thrashing, arrays should be scanned in increasing order, while multidimensional arrays should use the rightmost index for the innermost loops (e.g. for i to size; for j to size; A[j][i] ).
Memory alignment refers to arranging similarly-typed fields together in a structure with the most restrictively aligned types first. Compilers use an alignment criterion for fundamental types, thereby creating memory blocks that are certain multiples of memory. For maximum efficiency and avoid adding padding between objects, it is best practice to arrange types in objects from largest to smallest.
Another rearrangement in code is called hot/cold splitting, where locality of reference is obtained by splitting data structures into frequently used and infrequently used segments and allocating all the "frequently used" stuff together.
Reduce and reuse refer to techniques that minimize memory operations with temporal locality that reduce cache fetches. This is accomplished by reuse of data still in the cache by merging loops that use the same data.
Conclusions
The art of writing cache-aware code requires careful analysis of code with memory profiling tools. While there are many more memory optimization techniques, the general techniques described in this short article would go a long way to increase execution performance of many programs.
For more information on code performance improvements, see C++ code performance and C optimization.
References
C. Ericson. Real-time collision detection. Morgan-Kaufmann, 2005.
ISBN:1558607323. (Chapter on memory optimization)
Hennessy, J. L., and Patterson, D. A. Computer Architecture: A Quantitative
Approach, second ed. Morgan Kaufmann, 1996.
D.N. Truong, F. Bodin and A. Seznec, "Improving cache behavior of dynamically allocated data structures," Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192), Paris, 1998, pp. 322-329. doi: 10.1109/PACT.1998.727268
Opinions expressed by DZone contributors are their own.
Comments