Quite frequently due to memory leaks or other memory problems applications freeze leaving only the garbage collector (GC) process running unsuccessfully trying to free some space. This happens until watchdog (or frustrated administrator) restarts the application and the problem is never solved.
The goal of this article is to show how to identify excessive GC and to get a heap dump whenever it occurs. It assumes Hotspot JVM version 1.6 or higher.
Relevant JVM flags
Using following flags we can direct Hotspot JVM to throw a heap dump when an application becomes GC driven.
First of all -XX:+HeapDumpOnOutOfMemoryError flag should be added.
Our goal is to have an OutOfMemory error generated when throughput of an application drops because of non stopping GC. There are two JVM flags that will help:
- -XX:GCTimeLimit=98 - defines a limit of proportion of time spent in GC before an OutOfMemory error is thrown
- -XX:GCHeapFreeLimit=2 - defines minimum percentage of free space after a full GC before an OutOfMemoryError is thrown
Both flags are enabled by default, but the out of memory error is not triggered frequently. Hence, deeper understanding is needed in order to decide when each flag is useful and what should be the defaults.
Excessive GC Time and OutOfMemoryError.
The concurrent collector will throw an OutOfMemoryError if too much time is being spent in garbage collection: if more than 98% of the total time is spent in garbage collection and less than 2% of the heap is recovered, an OutOfMemoryError will be thrown. This feature is designed to prevent applications from running for an extended period of time while making little or no progress because the heap is too small. If necessary, this feature can be disabled by adding the option -XX:-UseGCOverheadLimit to the command line.
The policy is the same as that in the parallel collector, except that time spent performing concurrent collections is not counted toward the 98% time limit. In other words, only collections performed while the application is stopped count toward excessive GC time. Such collections are typically due to a concurrent mode failure or an explicit collection request (e.g., a call to System.gc()).
As the second passage states, GCTimeLimit does not work well with parallel collector as time spent in GC is a less important factor in that case.
Choosing defaults for GCHeapFreeLimit
The defaults should be chosen in such a way that when an application becomes unresponsive (or driven by GC) out of memory will be thrown. Unresponsiveness is a subjective definition that may vary from application to application, therefore the resulting value may be (slightly) different for different applications.
First a simple test should be created. Class that runs inside an application and allocates huge amounts of objects. Part of the objects are collected fast and part live for a long time.
GC logging flags should be added before running the test. For my application following flags were added:
-Xmx960m -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintTenuringDistribution -Xloggc:/root/gc.log -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/root/dumps
The output of GC log given below. First full GC cycle - application is responsive:
[Full GC [PSYoungGen: 290112K->33213K(308672K)] [PSOldGen: 655359K->655359K(655360K)] 945471K->688573K(964032K)]
After few minutes - application in unresponsive:
[Full GC [PSYoungGen: 290112K->213269K(308672K)] [PSOldGen: 655359K->655359K(655360K)] 945471K->868629K(964032K)]
For 950 Mb heap the distribution of memory spaces is: old gen = 650Mb, young = 300Mb. When the application is responsive, old generation becomes full with strong referenced data, while most of the young generation is garbage.
Therefore one estimate for GCHeapFreeLimit is (all young is cleared and nothing is cleared from old) 300/950 ~ 32%.
However when the time passes, objects from the young space can not be promoted to the old one, because it's full. As it says in the Oracle documentation: The youngest generation collection does not require a guarantee of full promotion of all live objects. (-XX:+HandlePromotionFailure flag).
Because of failed promotions there are much more referenced objects than expected at the young generation. In order to be on the safe side (less false positives) I assume that when more than 1/3 of the young space is garbage and old space is full the system is driven by GC.
Therefore the ratio I advice for application similar to mine is 200/950 ~ 20%.
Experiments show that OOM happens after 1-2 minutes of excessive GC above the 20% limit with 30-35 full GCs happening during that time.
Identifying excessive garbage collection is a difficult task in Java with no silver bullet. Hotspot JVM developers provided methods that will help frequently with identification of root cause. GCHeapFreeLimit and GCTimeLimit flags can shed the light on the problem when used properly: with appropriate values.