Java Threads on Steroids

DZone 's Guide to

Java Threads on Steroids

· Java Zone ·
Free Resource

If you're much into concurrency, then you must have stumbled upon the disruptor concurrency framework engineered and open-sourced by LMAX.

Its performance was compared to the ArrayBlockingQueue which is considered one of the most if not the most effective queue implementation. The numbers are indeed pretty impressive:

I recommend downloading the source code and running the tests for yourself.


Martin Fowler recently published a nice insight into the benefits the disruptor approach brings with its application. There is also great series of blog posts by Trish Gee on disruptor internals which is very helpful in not only understanding how this pattern works but also why it is so insanely fast.
But does having this new wonder-weapon in hand mean we have reached the limits of concurrent processing in Java?
Well, not necessarily; the beauty of disruptor's approach lies in its non-blocking behaviour. The only sections involving concurrent reads and writes are handled by memory barriers (using volatile access). This is much better than locking, but does not eliminate problems connected to context switching.

In order to eliminate the cost of context switching we would have to eliminate the switching itself. You can force a thread or a process to run only on a specified set of CPUs thus reducing the probability of kernel migrating it over all cores available to the system. This is called processor affinity. There are several tools that enable setting processor affinity in a very simple manner, ie. Linux control groups or taskset utility. But what if you want to be able to control CPU affinity for individual Java threads?
One way would be to use the RealtimeThread class capabilities from Realtime Specification for Java, but that would imply using non-standard JVM implementation. Poor man's solution could involve using JNI to make native calls to kernel's sched_setaffinity or pthread_setaffinity_np if using POSIX api. To cut the theoretical considerations and learn the practical implications of applying this approach, let's take a look at the results.

This screenshot shows load for all CPUs when the tests were running with default processor affinity. You can see frequent changes in individual CPU loads. This is due to the workload being dynamically distributed among CPUs by system scheduler.

This in turn, shows how the load was distributed when the worker threads were pinned to their dedicated CPUs with fixed processor affinity.

And to illustrate the difference in terms of performance, the below shows the number of operations per second achieved with each approach:

The results not only show significant benefit from applying fixed processor affinity approach in terms of throughput but also do they expose virtual realtime characteristics by offering extremely stable and predictable results which is required by all realtime systems.

Some details:
  • The test being executed was UniCast1P1CPerfTest from the disruptor performance tests suite
  • There were 60 runs with 50.000.000 iterations each
  • CPUs were additionaly occupied by handling IRQs, so reconfiguring irq load balancing by using IRQBALANCE_BANNED_CPUS could render slightly better results
  • The exact number of context switches can be measured using SystemTap or by examining ctxt property value in /proc/stat
  • You can achieve better results by employing Linux cgroups to separate application workload from system tasks by assigning two separate resource pools to those two different groups
  • These results should not be considered a magic trick to speed up your application for every possible scenario. This will be effective only to the CPU-intensive usecases

  • Topics:

    Opinions expressed by DZone contributors are their own.

    {{ parent.title || parent.header.title}}

    {{ parent.tldr }}

    {{ parent.urlSource.name }}