Inter Thread Latency
Join the DZone community and get the full member experience.
Join For FreeMessage rates between threads are fundamentally determined by the
latency of memory exchange between CPU cores. The minimum unit of
transfer will be a cache line exchanged via shared caches or socket
interconnects. In a previous article I explained Memory Barriers
and why they are important to concurrent programming between threads.
These are the instructions that cause a CPU to make memory visible to
other cores in an ordered and timely manner.
Lately I’ve been asked a lot about how much faster the Disruptor
would be if C++ was used instead of Java. For sure C++ would give more
control for memory alignment and potential access to underlying CPU
instructions such as memory barriers and lock instructions. In this
article I’ll directly compare C++ and Java to measure the cost of
signalling a change between threads.
For the test we'll use two counters each updated by their own thread. A
simple ping-pong algorithm will be used to signal from one to the other
and back again. The exchange will be repeated millions of times to
measure the average latency between cores. This measurement will give
us the latency of exchanging a cache line between cores in a serial
manner.
For Java we’ll use volatile counters which the JVM will kindly insert a
lock instruction for the update giving us an effective memory barrier.
public final class InterThreadLatency implements Runnable { public static final long ITERATIONS = 500L * 1000L * 1000L; public static volatile long s1; public static volatile long s2; public static void main(final String[] args) { Thread t = new Thread(new InterThreadLatency()); t.setDaemon(true); t.start(); long start = System.nanoTime(); long value = s1; while (s1 < ITERATIONS) { while (s2 != value) { // busy spin } value = ++s1; } long duration = System.nanoTime() - start; System.out.println("duration = " + duration); System.out.println("ns per op = " + duration / (ITERATIONS * 2)); System.out.println("op/sec = " + (ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration); System.out.println("s1 = " + s1 + ", s2 = " + s2); } public void run() { long value = s2; while (true) { while (value == s1) { // busy spin } value = ++s2; } } }
For C++ we’ll use the GNU Atomic Builtins which give us a similar lock instruction insertion to that which the JVM uses.
#include <time.h> #include <pthread.h> #include <stdio.h> typedef unsigned long long uint64; const uint64 ITERATIONS = 500LL * 1000LL * 1000LL; volatile uint64 s1 = 0; volatile uint64 s2 = 0; void* run(void*) { register uint64 value = s2; while (true) { while (value == s1) { // busy spin } value = __sync_add_and_fetch(&s2, 1); } } int main (int argc, char *argv[]) { pthread_t threads[1]; pthread_create(&threads[0], NULL, run, NULL); timespec ts_start; timespec ts_finish; clock_gettime(CLOCK_MONOTONIC, &ts_start); register uint64 value = s1; while (s1 < ITERATIONS) { while (s2 != value) { // busy spin } value = __sync_add_and_fetch(&s1, 1); } clock_gettime(CLOCK_MONOTONIC, &ts_finish); uint64 start = (ts_start.tv_sec * 1000000000LL) + ts_start.tv_nsec; uint64 finish = (ts_finish.tv_sec * 1000000000LL) + ts_finish.tv_nsec; uint64 duration = finish - start; printf("duration = %lld\n", duration); printf("ns per op = %lld\n", (duration / (ITERATIONS * 2))); printf("op/sec = %lld\n", ((ITERATIONS * 2L * 1000L * 1000L * 1000L) / duration)); printf("s1 = %lld, s2 = %lld\n", s1, s2); return 0; }
Results
$ taskset -c 2,4 /opt/jdk1.7.0/bin/java InterThreadLatency duration = 50790271150 ns per op = 50 op/sec = 19,688,810 s1 = 500000000, s2 = 500000000 $ g++ -O3 -lpthread -lrt -o itl itl.cpp $ taskset -c 2,4 ./itl duration = 45087955393 ns per op = 45 op/sec = 22,178,872 s1 = 500000000, s2 = 500000000
The C++ version is slightly faster on my Intel Sandybridge laptop. So
what does this tell us? Well, that the latency between 2 cores on a 2.2
GHz machine is ~45ns and that you can exchange 22m messages per second
in a serial fashion. On an Intel CPU this is fundamentally the cost of
the lock instruction enforcing total order and forcing the store buffer
and write combining buffers
to drain, followed by the resulting cache coherency traffic between the
cores. Note that each core has a 96GB/s port onto the L3 cache ring
bus, yet 22m * 64-bytes is only 1.4 GB/s. This is because we have
measured latency and not throughput. We could easily fit some nice fat
messages between those memory barriers as part of the exchange if the
data has been written before the lock instruction was executed.
So what does this all mean for the Disruptor? Basically, the latency of
the Disruptor is about as low as we can get from Java. It would be
possible to get a ~10% latency improvement by moving to C++. I’d expect
a similar improvement in throughput for C++. The main win with C++
would be the control, and therefore, the predictability that comes with
it if used correctly. The JVM gives us nice safety features like
garbage collection in complex applications but we pay a little for that
with the extra instructions it inserts that can be seen if you get
Hotspot to dump the assembler instructions it is generating.
How does the Disruptor achieve more than 25m messages per second I hear
you say??? Well that is one of the neat parts of its design. The “waitFor” semantics on the DependencyBarrier enables a very efficient form of batching, which allows the BatchEventProcessor to process a series of events that occurred since it last checked in with the RingBuffer,
all without incurring a memory barrier. For real world applications
this batching effect is really significant. It only makes the results
more random in a micro benchmark, where there is little work done other
than processing the message.
Conclusion
So when processing events in series, the measurements tell us that the
current generation of processors can do between 20-30 million exchanges
per second at a latency less than 50ns. The Disruptor design allows us
to get greater throughput without explicit batching on the publisher
side. In addition the Disruptor has an explicit batching API on the publisher side that can give over 100m messages per second.
From http://mechanical-sympathy.blogspot.com/2011/08/inter-thread-latency.html
Opinions expressed by DZone contributors are their own.
Comments