Locks & Condition Variables - Latency Impact
Join the DZone community and get the full member experience.
Join For FreeIn a previous article on Inter-Thread Latency
I showed how it is possible to signal a state change between 2 threads
with less than 50ns of latency. To many developers, writing concurrent
code using locks is a scary experience. Writing concurrent code using
lock-free algorithms, i.e. algorithms that rely on the use of memory
barriers and an intimate understanding of the underlying memory models,
can be totally terrifying. To me lock-free / non-blocking
algorithms are like playing with explosives or corrosive chemicals, if
you do not understand what you are doing, or show the ultimate respect,
then very bad things can, and most likely will, happen!
In this article, I'd like to illustrate the impact of using locks and
the resulting latency they can impose on your designs. I want to use a
very similar algorithm to that used in my previous inter-thread latency
article to illustrate the ping-pong effect of handing control back and
forth between 2 threads. In this case, rather than using a couple of
volatile variables, I will employ a pair of condition variables to
signal a state change so control can be passed back and forth.
The Code
import java.util.concurrent.locks.Condition; import java.util.concurrent.locks.Lock; import java.util.concurrent.locks.ReentrantLock; import static java.lang.System.out; public final class LockedSignallingLatency { private static final int ITERATIONS = 10 * 1000 * 1000; private static final Lock lock = new ReentrantLock(); private static final Condition sendCondition = lock.newCondition(); private static final Condition echoCondition = lock.newCondition(); private static long sendValue = -1L; private static long echoValue = -1L; public static void main(final String[] args) throws Exception { final Thread sendThread = new Thread(new SendRunner()); final Thread echoThread = new Thread(new EchoRunner()); final long start = System.nanoTime(); echoThread.start(); sendThread.start(); sendThread.join(); echoThread.join(); final long duration = System.nanoTime() - start; out.printf("duration %,d (ns)\n", duration); out.printf("%,d ns/op\n", duration / (ITERATIONS * 2L)); out.printf("%,d ops/s\n", (ITERATIONS * 2L * 1000000000L) / duration); } public static final class SendRunner implements Runnable { @Override public void run() { for (long i = 0; i < ITERATIONS; i++) { lock.lock(); try { sendValue = i; sendCondition.signal(); } finally { lock.unlock(); } lock.lock(); try { while (echoValue != i) { echoCondition.await(); } } catch (final InterruptedException ex) { break; } finally { lock.unlock(); } } } } public static final class EchoRunner implements Runnable { @Override public void run() { for (long i = 0; i < ITERATIONS; i++) { lock.lock(); try { while (sendValue != i) { sendCondition.await(); } } catch (final InterruptedException ex) { break; } finally { lock.unlock(); } lock.lock(); try { echoValue = i; echoCondition.signal(); } finally { lock.unlock(); } } } } }
Results
Windows 7 Professional 64-bit - Oracle JDK 1.6.0 - Nehalem 2.8 GHz
$ start /AFFINITY 0x14 /B /WAIT java LockedSignallingLatency duration 41,649,616,343 (ns) 2,082 ns/op 480,196 ops/s $ java LockedSignallingLatency duration 73,789,456,491 (ns) 3,689 ns/op 271,041 ops/s
Linux Fedora Core 15 64-bit - Oracle JDK 1.6.0 - Sandybridge 2.0 GHz
$ taskset -c 2,4 java LockedSignallingLatency duration 47,209,549,484 (ns) 2,360 ns/op 423,643 ops/s $ java LockedSignallingLatency duration 336,168,489,093 (ns) 16,808 ns/op 59,493 ops/s
Observations
The above is a typical set of results I've seen in the middle of the
range from multiple runs. There are a couple of interesting
observations I'd like to expand on.
Firstly, the one-way latency to signal a change is pretty much the same
as what is considered current state of the art for network hops between
nodes via a switch. It is possible to get ~1µs latency with InfiniBand and less than 5µs with 10GigE and user-space IP stacks.
This is 3 orders of magnitude greater latency than what I illustrated
in the previous article using just memory barriers to signal between
threads. This cost comes about because the kernel needs to get involved
to arbitrate between the threads for the lock, and then manage the
scheduling for the threads to awaken when the condition is signalled.
Secondly, the impact is clear when letting the OS choose what CPUs the
threads get scheduled on rather than pinning them manually. I've
observed this same issue across many use cases whereby Linux, in default
configuration for its scheduler, will greatly impact the performance of
a low-latency system by scheduling threads on different cores resulting
in cache pollution. Windows by default seems to make a better job of
this.
I recently had an interesting discussion with Cliff Click
about using condition variables and their cost. He pointed out a
problem he was seeing. If you look at the case where a sleeping thread
gets signalled within the lock, it goes to run and then discovers it
cannot get the lock because the signalling thread already has the lock,
so it gets put back to sleep until the signalling thread releases the
lock, thus causing more work than necessary. Modern schedulers would
benefit from being more aware of communication mechanisms between
threads to have more efficient location and rescheduling logic. As we
go more concurrent and parallel our schedulers need to become more aware
of IPC mechanisms.
Conclusion
When designing a low-latency system it is crucial to avoid the use of
locks and condition variables for the main transaction flows.
Non-blocking or lock-free algorithms are key to achieving ultra-low
latency but can be very difficult to prove correct. I would not
recommend designing lock-free algorithms for business logic but they can
be very effectively employed for low-level infrastructure components.
The business logic is best run on single threads following the Single Writer Principle from my previous article.
From http://mechanical-sympathy.blogspot.com/2011/11/locks-condition-variables-latency.html
Opinions expressed by DZone contributors are their own.
Comments