Java is the most popular application development platform on the planet, but that doesn’t always mean it's simple to understand — certainly not from the point of view of performance. Unlike with most other platforms, there is the Java Virtual Machine (JVM) that introduces an additional layer between your application code and the physical machine on which it runs. This virtualization has many advantages but does make an analysis of performance more involved.
The JVM does two things in particular that make life harder for identifying the source of performance problems.
- It manages memory for your application. In languages like C, you need to allocate memory for data explicitly and then (in C++ as well) remember to explicitly de-allocate that memory when you’ve finished with it. In Java, when you instantiate an object, the JVM allocates the required memory on the heap. When you have no more references to the object, it becomes eligible to be reclaimed by the (appropriately named) garbage collector (GC).
- Java’s fabled “write once, run anywhere” mantra means that a Java class file does not contain native instructions. Instead, it contains a semi-compiled version of your code (similar in many ways to p-code if you’ve studied compiler design). These bytecodes are converted at runtime to the appropriate native instructions for whatever platform is being used. Interpreting bytecodes adds overhead above simply by executing the native instructions directly from an executable file.
When Java was first launched, the algorithms used internally in the JVM to support these two features were pretty rudimentary. At that time, therefore, it was fair to say that Java applications did not perform as well as the equivalent C and C++ versions. Java picked up a reputation for being slow.
Clearly, in the last twenty-one years, the engineers at Sun, Oracle, and a host of other companies involved in Java have been working industriously to improve JVM performance. The internals of the JVM are completely different today. There are a variety of different GC algorithms available from JVM providers, most (although not all) of which use a heap divided into generations to take advantage of the fact that most applications conform to the weak generational hypothesis (most objects are only used for a very short amount of time). With pointer bumping, Java object space is allocated in something like six machine instructions, far faster than a call to malloc in C.
The use of concurrent and parallel collectors, especially in the case of the Zing C4 collector, makes GC far less of an overhead than it used to be. For bytecodes, the use of an adaptive compiler with different internal algorithms (often referred to as the C1 and C2 compilers) means that after a period of warm-up, Java applications now perform very close (and in certain cases even better) than compiled native code.
The JVM can be more aggressive in its optimizations through its knowledge of exactly which classes are loaded at any point in time. Even the problem of application warmup can be alleviated with technologies like Azul’s Zing ReadyNow!, which stores a profile of an application during execution. At startup, the profile can be used to substantially reduce the time required to analyze and compile frequently used sections of code.
The problem is that some people still believe Java works the way it did twenty years ago when looking at Java performance. One of my colleagues, when I was at Sun, had an excellent way to illustrate this. When talking to people with Java performance problems his approach was something like this:
If you have a C or C++ application that runs slowly, your first thought is that there is some problem with your code. You need to analyze your code and fix it. If you have a Java application that runs slowly, why would your first assumption be that the JVM is running slowly and causing your problems?
His statement is certainly true, but it doesn’t mean that in all cases the poor performance of a Java application comes down to poorly written code. The sheer volume of material written about tuning the JVM is a testament to this.
The question you should start with is, “Is my poor performance caused by my application code or the JVM?” Of course, we are now faced with a tough question to answer. Or are we?
Now, when we talk about performance in this context, what we’re going to look at is latency. Latency is how long it takes for a system to respond to a request and may be caused by many different factors.
To help separate out latency caused by our application from latency caused by the underlying system (that’s everything from the JVM down through the OS to the hardware), Azul created jHiccup. As I wrote in an earlier blog, the idea of jHiccup is to run Java code alongside your application so that it experiences the same effects your application does. It doesn’t require any modifications to your code, as it simply runs in a separate thread alongside your code. There’s no interaction, so there's no adverse performance impact. To minimize the impact of jHiccup itself, it spends most of its time asleep. When it does wake up, all it needs to do is record the difference in time (measured using nanosecond resolution) between when it wakes up and when it thought it should wake up. There is a little more work involved, but that is the crux of it.
The raw results from jHiccup are a good start; now, we have a way to measure the latency of the platform and separate it from the latency of the application (which is much easier to measure using whatever benchmarking tools you use). However, what we also need is a clear way to understand the results rather than having to pore over a large set of numeric values to figure out what’s going on. Azul has also developed a graphical tool that will take the output of jHiccup (which generates high-dynamic-range histogram files) and produce easy to interpret graphs.
The tool is called HistogramLogAnalyzer, and you can download the source from GitHub here. HistogramLogAnalyzer takes the output of jHiccup and generates a simple pair of graphs, an example of which is shown below:
The top graph shows how much the jHiccup thread varied from its expected wake up time over the time of the data collection. This graph can be useful for identifying significant spikes that may indicate something specific happening on the platform (garbage collection happening, IO blocking, etc.) as well as help you get a feel for the overall platform performance when running your application. In this example, there is quite a lot of variation, but looking at the scale of the graph this is only in the region of 1 to 4ms, which would not worry most people.
The lower graph shows the results but plotted in terms of percentiles. This graph can be very useful for establishing whether the platform can meet the requirements of an SLA that you have for your application. The data from this graph can then be used to help tune both the platform and the application code.
Timing your Java platform hiccups can be a very useful tool when trying to analyze and correct latency performance problems. Why not give it a go?