After my last post on Java Lock Implementations,
I got a lot of good feedback about my results and micro-benchmark
design approach. As a result I now understand JVM warmup, On Stack
Replacement (OSR) and Biased Locking somewhat better than before.
Special thanks to Dave Dice from Oracle, and Cliff Click & Gil Tene from Azul, for their very useful feedback.
In the last post I concluded, based on my experiments, that biased
locking was no longer necessary on modern CPUs. While this conclusion
is understandable given the data gathered in the experiment, it was not
valid because the experiment did not take account of some JVM warm up
behaviour that I was unaware of.
In this post I will re-run the experiment taking into account the
feedback and present some new results. I shall also expand on the
changes I've made to the test and why it is important to consider the
JVM warm-up behaviour when writing micro-benchmarks, or even very lean
Java applications with quick start up time.
On Stack Replacement (OSR)
Java virtual machines will compile code to achieve greater performance
based on runtime profiling. Some VMs run an interpreter for the
majority of code and replace hot areas with compiled code following the
80/20 rule. Other VMs compile all code simply at first then replace the
simple code with more optimised code based on profiling. Oracle
Hotspot and Azul are examples of the first type and Oracle JRockit is an
example of the second.
Oracle Hotspot will count invocations of a method return plus branch
backs for loops in that method, and if this exceeds 10K in server mode
the method will be compiled. The compiled code on normal JIT'ing can be
used when the method is next called. However if a loop is still
iterating it may make sense to replace the method before the loop
completes, especially if it has many iterations to go. OSR is the means
by which a method gets replaced with a compiled version part way
through iterating a loop.
I was under the impression that normal JIT'ing and OSR would result in
similar code. Cliff Click pointed out that it is much harder for a
runtime to optimise a loop part way through, and especially difficult if
nested. For example, bounds checking within the loop may not be
possible to eliminate. Cliff will blog in more detail on this shortly.
What this means is that you are likely to get better optimised code by
doing a small number of shorter warm ups than a single large one. You
can see in the code below how I do 10 shorter runs in a loop before the
main large run compared to the last article where I did a single large
warm-up run.
Biased Locking
Dave Dice pointed out that Hotspot does not enable objects for biased
locking in the first few seconds (4s at present) of JVM startup. This is
because some benchmarks, and NetBeans, have a lot of thread contention
on start up and the revocation cost is significant. Other VMs such as
Azul use biased locking right from the start which is not an issue
because their revocation is cheap.
All objects by default are created with biased locking enabled in Oracle
Hotspot after the first few seconds of start-up delay, and can be
configured with -XX:BiasedLockingStartupDelay=0.
This point, combined with knowing more about OSR, is important for
micro-benchmarks. It is also important to be aware of these points if
you have a lean Java application that starts in a few seconds.
The Code
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.concurrent.CyclicBarrier;
import static java.lang.System.out;
public final class TestLocks implements Runnable
{
public enum LockType { JVM, JUC }
public static LockType lockType;
public static final long WARMUP_ITERATIONS = 100L * 1000L;
public static final long ITERATIONS = 500L * 1000L * 1000L;
public static long counter = 0L;
public static final Object jvmLock = new Object();
public static final Lock jucLock = new ReentrantLock();
private static int numThreads;
private final long iterationLimit;
private final CyclicBarrier barrier;
public TestLocks(final CyclicBarrier barrier, final long iterationLimit)
{
this.barrier = barrier;
this.iterationLimit = iterationLimit;
}
public static void main(final String[] args) throws Exception
{
lockType = LockType.valueOf(args[0]);
numThreads = Integer.parseInt(args[1]);
for (int i = 0; i < 10; i++)
{
runTest(numThreads, WARMUP_ITERATIONS);
counter = 0L;
}
final long start = System.nanoTime();
runTest(numThreads, ITERATIONS);
final long duration = System.nanoTime() - start;
out.printf("%d threads, duration %,d (ns)\n", numThreads, duration);
out.printf("%,d ns/op\n", duration / ITERATIONS);
out.printf("%,d ops/s\n", (ITERATIONS * 1000000000L) / duration);
out.println("counter = " + counter);
}
private static void runTest(final int numThreads, final long iterationLimit)
throws Exception
{
CyclicBarrier barrier = new CyclicBarrier(numThreads);
Thread[] threads = new Thread[numThreads];
for (int i = 0; i < threads.length; i++)
{
threads[i] = new Thread(new TestLocks(barrier, iterationLimit));
}
for (Thread t : threads)
{
t.start();
}
for (Thread t : threads)
{
t.join();
}
}
public void run()
{
try
{
barrier.await();
}
catch (Exception e)
{
// don't care
}
switch (lockType)
{
case JVM: jvmLockInc(); break;
case JUC: jucLockInc(); break;
}
}
private void jvmLockInc()
{
long count = iterationLimit / numThreads;
while (0 != count--)
{
synchronized (jvmLock)
{
++counter;
}
}
}
private void jucLockInc()
{
long count = iterationLimit / numThreads;
while (0 != count--)
{
jucLock.lock();
try
{
++counter;
}
finally
{
jucLock.unlock();
}
}
}
}
Script to run tests:
set -x
for i in {1..8}
do
java -server -XX:-UseBiasedLocking TestLocks JVM $i
done
for i in {1..8}
do
java -server -XX:+UseBiasedLocking -XX:BiasedLockingStartupDelay=0 TestLocks JVM $i
done
for i in {1..8}
do
java -server TestLocks JUC $i
done
Results
The tests are carried out with 64-bit Linux (Fedora Core 15) and Oracle JDK 1.6.0_29.
Nehalem 2.8GHz - Ops/Sec |
Threads | -UseBiasedLocking | +UseBiasedLocking | ReentrantLock |
1 | 53,283,461 | | |
2 | 18,519,295 | | |
3 | 13,349,605 | | |
4 | 8,120,172 | | |
5 | 4,725,114 | | |
6 | 5,133,706 | | |
7 | 5,473,652 | | |
8 | 5,514,056 | | |
Sandybridge 2.0GHz - Ops/Sec |
Threads | -UseBiasedLocking | +UseBiasedLocking | ReentrantLock |
1 | | | |
2 | | | |
3 | | | |
4 | | | |
5 | | | |
6 | | | |
7 | | | |
8 | | | |
Observations
- Biased locking has a huge benefit in the un-contended single threaded case.
- Biased locking when un-contended, and not revoked, only adds 4-5
cycles of cost. This is the cost when having a cache hit for the lock
structures, on top of the code protected in the critical section.
- -XX:BiasedLockingStartupDelay=0 needs to be set for lean applications and micro-benchmarks.
- Avoiding OSR does not make a material difference to this set of test
results. This is likely to be because the loop is so simple or other
costs are dominating.
- For the current implementations, ReentrantLocks scale better than
synchronised locks under contention, except in the case of 2 contending
threads.
Conclusion
My tests in the last
post
are invalid for the testing of an un-contended biased lock, because the
lock was not actually biased. If you are designing code following the
single writer principle,
and therefore having un-contended locks when using 3rd party libraries,
then having biased locking enabled is a significant performance boost.
From http://mechanical-sympathy.blogspot.com/2011/11/biased-locking-osr-and-benchmarking-fun.html
Comments