But before jumping to the case itself, some background about the JVM internals being covered. All following is relevant for Oracle Hotspot 7. Other vendors or older releases of the Hotspot JVM’s most likely ship with different defaults.
JVM default options
First stop: JVM tries to determine whether it is running on a server on a client environment. It does it via looking into the architecture and OS combination. In simple summary:
|32-bit SPARC||2+ cores & > 2GB RAM||Solaris||Server|
|32-bit SPARC||1 core or < 2GB RAM||Solaris||Client|
|i568||2+ cores & > 2GB RAM||Linux or Solaris||Server|
|i568||1 core or < 2GB RAM||Linux or Solaris||Client|
As an example – if you are running on an Amazone EC2 m1.medium instance on 32-bit Linux you would be considered running on a client machine by default.
This is important because JVM optimizes completely differently on client and on server – on client machines it tries to reduce startup time and skips some optimizations during startup. On server environments some startup time is sacrificed to achieve higher throughput later.
Second set of defaults: Heap sizing. If your environment is considered to be a Server determined according to the previous guidelines, your initial heap allocated will be 1/64 of the memory available on the machine. On 4G machine, it would mean that your initial heap size will be 64MB. If running on extremely low memory conditions (<1GB) it can be smaller, but in this case I would seriously doubt you are doing anything reasonable. Have not seen a server in this millenia with less than gig of memory. And if you have, I’ll remind you that a GB of DDR costs less than $20 nowadays …
But this will be the initial heap size. The maximum heap size will be the smallest of either ¼ of your total memory available or 1GB. So in our 1.7GB Amazon EC2 m1.small instance the maximum heap size available for the JVM would be approximately 435MB.
Next along the line: default garbage collector used. If you are considered to be running on a client JVM, the default applied by the JVM would be Serial GC (-XX:+UseSerialGC). On server-class machines (again, see the first section) the default would be Parallel GC (-XX:+UseParallelGC).
There is a lot more going on on with the defaults, such as PermGen sizing, different generation tweaks, GC pausing limits, etc. But in order to keep the size of the post under control, lets just stick with the aforementioned configurations. For the curious ones – you can read further about the defaults from the following materials:
Now lets see how our case study behaves. And whether we should trust the JVM with the decisions or jump in ourselves.
Our application at hand was an issue tracker. Namely JIRA. Which is a web application with a relational database in the back-end. Deployed on Tomcat. Behaving badly in one of our client environments. And not because of any leaks but due to the different configuration issues in deployment. This misbehaving configuration resulted in significant losses in both throughput and latency due to the extremely long-running GC pauses. We managed to help out the customer, but for privacy’s sake we are not going to cover the exact details here. But the case was good, so we went ahead and downloaded the JIRA by ourselves to demonstrate some of the concepts we discovered from this real-world case study.
We carefully unboxed our newly acquired JIRA and installed it on a 64-bit Linux Amazon EC2 m1.medium instance. And ran the bundled tests. Without changing anything in the defaults. Which were set by the Atlassian team to -Xms256m -Xmx768m -XX:MaxPermSize=256m
During each run we have collected GC logs using -XX:+PrintGCTimeStamps -Xloggc:/tmp/gc.log -XX:+PrintGCDetails and analyzed this statistics with the help of GCViewer.
The results were not too bad actually. We ran the tests for an hour, and out of this we lost just 151 seconds to the garbage collection pauses. Or 4.2% of the total runtime. And on the single worst-case gc pause was 2 seconds. So GC pauses were affecting both throughput and latency of this particular application. But not too much. But enough to serve as the baseline for this case study – in our real-world customer the GC pauses were spanning up to 25 seconds.
Digging into the GC logs surfaced an immediate problem. Most of the Full GC’s run were caused by the PermGen size expanding over time. Logs demonstrated that in total around 155MB of PermGen were used during tests. So we have increased the initial size of the PermGen to a bit more than actually used by adding -XX:PermSize=170m to the startup scripts. This decreased the total accumulated pauses from 151 seconds to 134 seconds. And decreased the maximal latency from 2,000ms to 1,300ms.
Then we discovered something completely unexpected. The GC used by our JVM was in fact Serial GC. Which, if you have carefully followed our post should not be the case – 64-bit Linux machines should always be considered server-class machines and the GC used should be Parallel GC. But apparently this is not the case. Our best guess at this point was that – even though the JVM launches in server mode, it still selects the GC used based on the memory and cores available. And as this m1.medium instance has 3.75GB memory but only one virtual core, the GC chosen is still serial. But if any of you guys have more insights on the topic, we are eager to find out more.
Nevertheless we changed the algorithm to -XX:+UseParallelGC and re-ran the tests. Results – accumulated pauses decreased further to 92 seconds. Worst-case latency was also reduced to 1,200ms.
For the final test we attempted to try out Concurrent Mark and Sweep mode. But this algorithm failed completely on us – pauses were increased to 300 seconds and latency to more than 5,000ms. Here we gave up and decided to call it a night.
So just playing with two JVM startup parameters and spending few hours on configuration and interpretation of the results we had effectively increased the throughput and latency of the application. The absolute numbers might not sound too impressive – GC pauses reducing from 151 seconds to 92 seconds and worst-case latency from 2,000ms to 1,200ms, but lets bear in mind this was just a small test with only two configuration settings. And looking from the % point of view – hey, we have both improved the GC pause-related throughput and reduced the latency by 40%!
In any case – we now have one more case to show you that – performance tuning is all about setting the goals, measuring, tuning and measuring again. And maybe you are just as lucky as we and can make your users 40% happier by just changing two configuration options …