[This article by Peter Lawrey comes to you from the DZone Guide to Performance & Monitoring -- 2015 Edition. For additional information including insight from industry experts and luminaries, performance statistics and strategies, and an overview of how modern companies are handling application monitoring, download the guide below.]
When you want your application to run faster you might start off by doing some CPU profiling. However, when I’m looking for quick wins in optimization, it’s the memory profiler I target first.
Allocating Memory Is Cheap
Allocating memory has never been cheaper. You can buy 16 GB for less than $200. There are affordable machines with hundreds of GBs of memory. The memory allocation operation is also cheaper than it has been in the past, and it’s multi-threaded, so it scales reasonably well. However, memory allocation is not free.
Your CPU cache is a precious resource, especially if you are trying to use multiple threads. While you can buy 16 GB of main memory easily, you might only have 2 MB of cache per logical CPU. If you want these CPUs to run independently, you want to spend as much time as possible within the 256 KB L2 cache (see table, top right).
Allocating Memory is Not Linear
Allocating memory on the heap is not linear. The CPU is very good at doing things in parallel. This means that if memory bandwidth is not your main bottleneck, the rate you produce garbage has less impact than whatever is your bottleneck. However, if the allocation rate is high enough (and in most Java systems it is high), it will be a serious bottleneck. You can tell that the allocation rate is a bottleneck if:
- You are close to the maximum allocation rate of the machine. Write a small test that creates a lot of garbage and measure the allocation rate. If you are close to the max allocation rate, you have a problem.
- When you reduce the garbage produced by say 10%, the 99% latency of your application becomes 10% faster, and yet the allocation rate hardly drops. This means your application sped up so that it reached your bottleneck again.
- You have very long GC pause times (e.g. into the seconds). At this point, your memory consumption is having a very high impact on your performance, so reducing the memory consumption and allocation rate can improve scalability (how many requests you can process concurrently) and reduce the amount of time during which the application freezes.
Combining the CPU and Memory Views
After reducing the memory allocation rate, I look at the CPU consumption with memory tracing turned on. This gives more weight to the memory allocations and will provide an alternative to just looking at the CPU alone. When this CPU and memory view shows you that the application is spending most of its time doing essential work, and there are no more easy performance gains to be made, I then look at CPU profiling alone. Using these techniques as a starting point, my aim typically is to reduce the 99th percentile latency (the worst 1%) by a factor of 10. This approach can also increase the throughput of each thread and allow you to run more threads concurrently in an efficient manner.