Dissecting the Disruptor: Why it's so fast (part two) - Magic cache line padding
Join the DZone community and get the full member experience.
Join For FreeWe mention the phrase Mechanical Sympathy quite a lot, in fact it's even Martin's blog title. It's about understanding how the underlying hardware operates and programming in a way that works with that, not against it.
We get a number of comments and questions about the mysterious cache line padding in the RingBuffer, and I referred to it in the last post. Since this lends itself to pretty pictures, it's the next thing I thought I would tackle.
Comp Sci 101
One of the things I love about working at LMAX
is all that stuff I learnt at university and in my A Level Computing
actually means something. So often as a developer you can get away with
not understanding the CPU, data structures or Big O notation -
I spent 10 years of my career forgetting all that. But it turns out
that if you do know about these things, and you apply that knowledge,
you can come up with some very clever, very fast code.
So, a refresher for those of us who studied this at school, and an intro
for those who didn't. Beware - this post contains massive
over-simplifications.
The CPU is the heart of your machine and the thing that ultimately has
to do all the operations, executing your program. Main memory (RAM) is
where your data (including the lines of your program) lives. We're
going to ignore stuff like hard drives and networks here because the Disruptor is aimed at running as much as possible in memory.
The CPU has several layers of cache between it and main memory, because
even accessing main memory is too slow. If you're doing the same
operation on a piece of data multiple times, it makes sense to load this
into a place very close to the CPU when it's performing the operation
(think a loop counter - you don't want to be going off to main memory to
fetch this to increment it every time you loop around).
The closer the cache is to the CPU, the faster it is and the smaller it is. L1 cache is small and very fast, and right next to the core that uses it. L2 is bigger and slower, and still only used by a single core. L3 is more common with modern multi-core machines, and is bigger again, slower again, and shared across cores on a single socket. Finally you have main memory, which is shared across all cores and all sockets.
When the CPU is performing an operation, it's first going to look in L1 for the data it needs, then L2, then L3, and finally if it's not in any of the caches the data needs to be fetched all the way from main memory. The further it has to go, the longer the operation will take. So if you're doing something very frequently, you want to make sure that data is in L1 cache.
Martin and Mike's QCon presentation gives some indicative figures for the cost of cache misses:
Latency from CPU to... | Approx. number of CPU cycles | Approx. time in nanoseconds |
Main memory | ~60-80ns | |
QPI transit (between sockets, not drawn) | ~20ns | |
L3 cache | ~40-45 cycles, | ~15ns |
L2 cache | ~10 cycles, | ~3ns |
L1 cache | ~3-4 cycles, | ~1ns |
Register | 1 cycle |
public long p1, p2, p3, p4, p5, p6, p7; // cache line padding private volatile long cursor = INITIAL_CURSOR_VALUE; public long p8, p9, p10, p11, p12, p13, p14; // cache line padding
From http://mechanitis.blogspot.com/2011/07/dissecting-disruptor-why-its-so-fast_22.html
Opinions expressed by DZone contributors are their own.
Comments