Microbenchmarking Comes to Java 9
Microbenchmarking is for measuring the performance of some small code fragment and is supported out the box in Java 9
Join the DZone community and get the full member experience.
Join For FreeI have not written article here for a few months and this will also continue with this exception. I plan to return writing around March next year. It will be explained at the end of the this article. Wait! Not exactly at the end, because you could just scroll down. It is somewhere towards the end of the article. Just read on!
Three years ago I was writing about how the Java compiler optimizes the code it executes. Or rather, how javac does not do that and how, at the same time, JIT does. I made some benchmarks, some really bad ones as mentioned by Esko Luontola. These benchmarks were meant to show that JIT was optimized even before it could gather significant statistical data about the execution of the code.
The article was written in January 2013, and the very first source code upload of JMH (Java Microbenchmark Harness) happened two months later. Since that time, the harness has been developed a lot, and next year it will become part of the next release of Java. I have a contract to write a book about Java 9, and its chapter 5 should cover Java 9 microbenchmarking possibilities, among other things. It is a good reason to start with something to play with around JMH.
Before getting into the details how to use JMH and what it is good for, let’s talk a bit about microbenchmarking.
Microbenchmarking
Microbenchmarking is measuring the performance of some small code fragment. It is rarely used, and before starting a microbenchmark for a real commercial environment, we have to think twice. Remember that premature optimization is the root of all evil. Some developers created a generalization of this statement, saying that optimization itself is the root of all evil, which may be true, especially if we mean microbenchmarking.
Microbenchmarking is a luring tool to optimize something small without knowing if it is worth optimizing that code. When we have a huge application that has several modules run on several servers, how can we be sure that improving some special part of the application drastically improves the overall performance? Will it pay off by increasing revenue and profits that will cover the cost we sunk into the performance testing and development? I am reluctant to say that you cannot know that, but only because such a statement would be too broad. Statistically you can almost be sure that such an optimization, including microbenchmarking, will not pay off most of the time. It will hurt. You just may not notice it, or even enjoy it, but that is a totally different story.
When to use microbenchmarking? I can see three areas on when to do so:
- You want to write an article about microbenchmarking.
- You identified the code segment that eats most of the resources in your application and the improvement can be tested by microbenchmarks.
- You cannot identify the code segment that eats most of the resources in an application, but you suspect it.
The first area is a joke. Or not: you can play around with microbenchmarking to understand how it works and then to understand how Java code works, what runs fast and what does not. Last year, Takipi posted an article where they tried to measure the speed of lambdas. Read it, it's a very good article and clearly demonstrates the major advantage of blogging over writing something for print. Readers commented and pointed out errors which were corrected in the article.
The second is the usual case. Okay, before a reader corrects me: the second should have been the usual case. The third is when you develop a library and you just do not know all the applications that will use it. In that case you will try to optimize the part that you think is the most crucial for most of the imagined, suspected applications. Even in that case it is better to take some sample applications.
Pitfalls
What are the pitfalls of Microbenchmarking? Benchmarking is done as an experiment. The first programs I wrote were TI calculator code, and I could just count the number of steps the program made to factor two large (10 digits at the time) prime numbers. Even at the time I was using an old Russian stop watch to measure the time, being too lazy to calculate the number of steps. Experimentation and measurement was easier.
Today you could not calculate the number of steps the CPU takes. There are so many small factors that may change the performance of the application that are out of control of the programmer, that it is impossible to make a calculation of the steps.
What is the biggest problem of measurements? We are interested in something, say X, and we usually cannot measure that. So we measure Y instead and hope that the value of Y and X are comparable. We want to measure the length of the room, but instead we measure the time it takes for the laser beam to travel from one end to the other. In this case the length X and the time Y are strongly coupled. Most of the time, X and Y are more or less correlated. Most of the time, when people measure the values, X and Y have no relation to each other at all. Still, people put their money and more on decisions backed by such measurements. Think about political elections as an example.
Microbenchmarking is no different. It is hardly ever done well. If you are interested in details and possible pitfalls, Aleksey Shipilev has a good one hour video. The first question is how to measure the execution time. Small amounts of code run in a short time, and System.currentTimeMillis() may return the same value when the measurement starts and ends, because we are still in the same millisecond. Even if the execution is 10ms, the error of the measurement is still at least 10%, purely because of the quantization of time as we measure. Luckily there is System.nanoTime(). We happy, Vincent?
Not really. nanoTime() returns the current value of the running Java Virtual Machine’s high-resolution time source, in nanoseconds as the documentation says. What is “current”? When the invocation was made? Or when it was returned? Or sometime between? Select the one you want and you may still fail. That current value could have been the same during the last 1000ns, that is something all Java implementations should guarantee.
And another caveat before using nanoTime() from the documentation: "Differences in successive calls that span greater than approximately 292 years (263 nanoseconds) will not correctly compute elapsed time due to numerical overflow." 292 years? Really?
There are other problems as well. When you start up Java code, the first few thousand executions of the code will be interpreted or executed without run-time optimization. JIT has the advantage over compilers of statically compiled languages like Swift, C, C++, or Golang that it can gather run-time information from the execution of the code. When it sees that the compilation it performed last time could have been better based on recent run-time statistics, it compiles the code again. The same may be true for garbage collection that also tries to use statistics to tune its operational parameters. Because of this, well-written server applications gain a bit of performance over time. They start up a bit slower and then they just become faster. If you restart the server, the whole iteration starts again.
If you do micro benchmarks you should care about this behavior. Do you want to measure the performance of the application during warm-up time or how it really executes in operation?
The solution is a microbenchmarking harness that tries to consider all these caveats. The one that will be included in Java 9 is JMH.
What is JMH?
“JMH is a Java harness for building, running, and analyzing nano/micro/milli/macro benchmarks written in Java and other languages targeting the JVM.” (From the official site)
You can run JMH as a separate project independent, from the actual project you measure, or you can just store the measurement code in a separate directory. The harness will compile against the production class files and will execute the benchmark. The easiest way to do this, as I see it, is to use the Gradle plugin to execute JMH. You store the benchmark code in a directory called jmh (the same level as main and test) and create a main that can start the benchmark.
import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.runner.Runner;
import org.openjdk.jmh.runner.RunnerException;
import org.openjdk.jmh.runner.options.Options;
import org.openjdk.jmh.runner.options.OptionsBuilder;
import java.io.IOException;
public class MicroBenchmark {
public static void main(String... args) throws IOException, RunnerException {
Options opt = new OptionsBuilder()
.include(MicroBenchmark.class.getSimpleName())
.forks(1)
.build();
new Runner(opt).run();
}
There is a nice builder interface for the configuration and a Runner class that can execute the benchmarks.
Playing a Bit
In the book Java 9 Programming By Example, one of the examples is the Mastermind game. Chapter 5 is all about solving the game in parallel to speed up the guessing. (If you do not know the game, please read about it on Wikipedia, I do not want to explain it here, but you will need it to understand it.)
The normal guessing is simple. There is a hidden secret. The secret is four pegs of four different colors out of 6 colors. When we guess, we take the possible color variations our of the equation, one after the other, and ask the question the table: "If this selection is the secret, are all answers correct?" In other words: can this guess be hidden, or is there some contradiction in the answers for previous answers? If this guess can be the secret, then we will give it a try, putting the pegs on the table. The answer may be 4/0 (alleluia) or something else. In the latter case, we go on searching. This way the 6 color, 4 columns table can be solved in five steps.
For the shake of simplicity and visualization we name the colors with numbers, like 01234456789 (we have ten colors in the JMH benchmark, since 6 colors are just not enough) and 6 pegs. The secret we use is 987654, because this is the last guess as we go from 123456, 123457, and so on.
When I first coded this game in August 1983 on a Swedish school computer (ABC80) in BASIC, each guess took 20 to 30 seconds on the z80 processor running on 40MHz with 6 colors, 4 positions. Today my MacBook Pro can play the whole game using single-thread approximately 7 times in a second using 10 colors and 6 pegs. But that is not enough when I have 4 processors in the machine supporting 8 parallel threads.
To speed up the execution, I split up the guess space into equal intervals and I start separate guessers, each spitting guesses into a blocking queue. The main thread reads from the queue and puts the guesses on the table as they come. There is some post processing that may be needed in case some of the threads create a guess that becomes outdated by the time the main thread tries to use it as a guess, but still we expect a huge increase in speed.
Does it really speed up the guessing? That is JMH here for.
To run the benchmark, we need some code that actually executes the game:
@State(Scope.Benchmark)
public static class ThreadsAndQueueSizes {
@Param(value = {"1", "4", "8", "16", "32"})
String nrThreads;
@Param(value = { "1", "10", "100", "1000000"})
String queueSize;
}
@Benchmark
@Fork(1)
public void playParallel(ThreadsAndQueueSizes t3qs) throws InterruptedException {
int nrThreads = Integer.valueOf(t3qs.nrThreads);
int queueSize = Integer.valueOf(t3qs.queueSize);
new ParallelGamePlayer(nrThreads, queueSize).play();
}
@Benchmark
@Fork(1)
public void playSimple(){
new SimpleGamePlayer().play();
}
The JMH framework will execute the code several times, measuring the time to run with several parameters. The method playParallel will be executed to run the algorithm for 1, 4, 5, 10, and 32 threads, each with 1, 10, 100, and one million maximum queue length. When the queue is full, the individual guessers stop with their guessing until the main thread pulls at least one guess off the queue.
I suspect if we have many threads and we do not limit the length of the queue, then the worker threads will fill the queue with initial guesses that are just based on an empty table and thus do not deliver much value. What do we see after almost 15 minutes of execution?
Benchmark (nrThreads) (queueSize) Mode Cnt Score Error Units
MicroBenchmark.playParallel 1 1 thrpt 20 6.871 ± 0.720 ops/s
MicroBenchmark.playParallel 1 10 thrpt 20 7.481 ± 0.463 ops/s
MicroBenchmark.playParallel 1 100 thrpt 20 7.491 ± 0.577 ops/s
MicroBenchmark.playParallel 1 1000000 thrpt 20 7.667 ± 0.110 ops/s
MicroBenchmark.playParallel 4 1 thrpt 20 13.786 ± 0.260 ops/s
MicroBenchmark.playParallel 4 10 thrpt 20 13.407 ± 0.517 ops/s
MicroBenchmark.playParallel 4 100 thrpt 20 13.251 ± 0.296 ops/s
MicroBenchmark.playParallel 4 1000000 thrpt 20 11.829 ± 0.232 ops/s
MicroBenchmark.playParallel 8 1 thrpt 20 14.030 ± 0.252 ops/s
MicroBenchmark.playParallel 8 10 thrpt 20 13.565 ± 0.345 ops/s
MicroBenchmark.playParallel 8 100 thrpt 20 12.944 ± 0.265 ops/s
MicroBenchmark.playParallel 8 1000000 thrpt 20 10.870 ± 0.388 ops/s
MicroBenchmark.playParallel 16 1 thrpt 20 16.698 ± 0.364 ops/s
MicroBenchmark.playParallel 16 10 thrpt 20 16.726 ± 0.288 ops/s
MicroBenchmark.playParallel 16 100 thrpt 20 16.662 ± 0.202 ops/s
MicroBenchmark.playParallel 16 1000000 thrpt 20 10.139 ± 0.783 ops/s
MicroBenchmark.playParallel 32 1 thrpt 20 16.109 ± 0.472 ops/s
MicroBenchmark.playParallel 32 10 thrpt 20 16.598 ± 0.415 ops/s
MicroBenchmark.playParallel 32 100 thrpt 20 15.883 ± 0.454 ops/s
MicroBenchmark.playParallel 32 1000000 thrpt 20 6.103 ± 0.867 ops/s
MicroBenchmark.playSimple N/A N/A thrpt 20 6.354 ± 0.200 ops/s
(For the score, the greater the value, the better.) It shows that the best performance we get if we start 16 threads and if we somewhat limit the length of the queue. Running the parallel algorithm on one thread (a mater and a worker) is somewhat slower than the single thread implementation. This seems to be okay: we have the overhead of starting a new thread and communication between the threads. The maximum performance we have is around 16 threads. Since we can have 8 cores in this machine we expected a peek of around 8. Why is that?
What happens if we replace the standard secret 987654 (which is boring after a while even for a CPU) with something random?
Benchmark (nrThreads) (queueSize) Mode Cnt Score Error Units
MicroBenchmark.playParallel 1 1 thrpt 20 12.141 ± 1.385 ops/s
MicroBenchmark.playParallel 1 10 thrpt 20 12.522 ± 1.496 ops/s
MicroBenchmark.playParallel 1 100 thrpt 20 12.516 ± 1.712 ops/s
MicroBenchmark.playParallel 1 1000000 thrpt 20 11.930 ± 1.188 ops/s
MicroBenchmark.playParallel 4 1 thrpt 20 19.412 ± 0.877 ops/s
MicroBenchmark.playParallel 4 10 thrpt 20 17.989 ± 1.248 ops/s
MicroBenchmark.playParallel 4 100 thrpt 20 16.826 ± 1.703 ops/s
MicroBenchmark.playParallel 4 1000000 thrpt 20 15.814 ± 0.697 ops/s
MicroBenchmark.playParallel 8 1 thrpt 20 19.733 ± 0.687 ops/s
MicroBenchmark.playParallel 8 10 thrpt 20 19.356 ± 1.004 ops/s
MicroBenchmark.playParallel 8 100 thrpt 20 19.571 ± 0.542 ops/s
MicroBenchmark.playParallel 8 1000000 thrpt 20 12.640 ± 0.694 ops/s
MicroBenchmark.playParallel 16 1 thrpt 20 16.527 ± 0.372 ops/s
MicroBenchmark.playParallel 16 10 thrpt 20 19.021 ± 0.475 ops/s
MicroBenchmark.playParallel 16 100 thrpt 20 18.465 ± 0.504 ops/s
MicroBenchmark.playParallel 16 1000000 thrpt 20 10.220 ± 1.043 ops/s
MicroBenchmark.playParallel 32 1 thrpt 20 17.816 ± 0.468 ops/s
MicroBenchmark.playParallel 32 10 thrpt 20 17.555 ± 0.465 ops/s
MicroBenchmark.playParallel 32 100 thrpt 20 17.236 ± 0.605 ops/s
MicroBenchmark.playParallel 32 1000000 thrpt 20 6.861 ± 1.017 ops/s
The performance increases, since we do not need to go though all the possible variations. In the case of one thread, there is a 100% increase. In case of multiple threads, the performance does not improve that much. Note that this does not speed the code itself up, it only measures more realistically using statistical, random secrets. What we can also see is that the gain of 16 threads over 8 threads is not significant anymore. This is significant only when we select a secret that is towards the end of the variations. Why? From what you have seen here and from the source code available in GitHub, you can answer that question yourself.
Summary
The book Java 9 Programming By Example is planned to be released February 2017. But since we are living in an open source world you can get access controlled by the publisher to 1.x.x-SNAPSHOT versions. Now I told you the preliminary GitHub URL that I use while I develop code for the book, you can also preorder the eBook and give feedback to help me to create a better book!
Published at DZone with permission of Peter Verhas, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments