This is the first in a series of posts about Garbage Collection (GC). I hope to be able to cover a bit of theory and all the major collectors in the hotspot virtual machine over the course of the series. This post just explains what garbage collection is and elements common to different collectors.
Why should I care?
Your Java virtual machine manages memory for you - which is highly convenient - but it might not be optimally tuned by default. By understanding some of the theory behind garbage collection you can more easily tune your collector. A common concern is collector efficiency, that is to say how much time your program spends executing program code rather than collecting garbage. Another common concern is long that application pauses for.
There's also a lot of hearsay and folklore out there about garbage collection and so understanding the algorithms in a bit more detail really helps avoid falling into common pitfalls and traps. Besides - for anyone interested in how computer science principles are applied and used, JVM internals are a great thing to look at.
What does stop-the-world mean?
Your program (or mutator in GC-Speak) will be allocating objects as it runs. At some point your heap needs to be collected and all of the collectors in hotspot pause your application. The term 'stop-the-world' is used to mean that all of the mutator's threads are paused.
Its possible to implement a garbage collector that doesn't need to pause. Azul have implemented an effectively pauseless collector in their Zing virtual machine. I won't be covering how it works but there's a really interesting whitepaper if you want to know more.
The Young/Weak Generational Hypothesis
Simply stated: Most allocated objects die young1. This concept was demonstrated by empirically analysing the memory allocation and liveness patterns of a large group of programs during the 1980s. What researchers found was that not only do most objects die young but once they live past a certain age they tend to live for a long time. The graph below is taken from a SUN/Oracle study looking at the lifespan of objects as a histogram.
How is Heap organised?
The young generational hypothesis has given rise to the idea of generational garbage collection in which heap is split up into several regions, and the placement of objects within each region corresponds to their age. One element that is common to the above these garbage collectors (other than G1) is the way that heap is organised into different spaces.
When objects are initially allocated, if they fit, they are stored in the Eden space. If the object survives a collection then it ends up in a survivor space. If it survives a few times (your tenuring threshold) then the object ends up in the tenured space. The specifics of the algorithms for collecting these spaces differs by collector, so I'll be covering them seperately in a future blog post.
This split is beneficial because it allows you to use different algorithms on different spaces. Some GC algorithms are more efficient if most of your objects are dead and some are more efficient if most of your objects are alive. Due to the generational hypothesis usually when it comes time to collect most objects in Eden and survivor spaces are dead, and most objects in tenured are alive.
There is also the permgen - or permanent generation. This is a special generation that holds objects that are related to the Java language itself. For example information about loaded classes is held here. Historically Strings that were interened or were constants were also held here. The permanent generation is being removed in favour of metaspace.
The hotspot virtual machine actually has a variety of different Garbage Collectors. Each has a different set of performance characteristics and is more (or less) suited for different tasks. The key Garbage Collectors that I'll be looking at are:
- Parallel Scavenge (PS): the default collector in recently released JVMs. This stops-the-world in order to collect, but collects in parallel (ie using multiple threads).
- Concurrent Mark Sweep (CMS): this collector has several phases, some of which stop the world, but runs concurrently with the program for several of its phases as well.
- Incremental Concurrent Mark Sweep (iCMS): a variant of CMS designed for lower pauses. It sometimes achieves this!
- Garbage First (G1): a newish collector that's recently become more stable and is in slowly increasing usage.
I've given a few introductory points of thought about garbage collection, in the next post I'll be covering the Parallel Scavenge collector - which is currently the default collector. I'd also like to provide a link to my employer who have a GC log analyser which we think is pretty useful.
- "hotspot" is the name given to the codebase common behind openjdk and the official Oracle JVM. As of Java 7 openjdk is the reference implementation for Java SE.
- Technically what I described above is the 'weak generational hypothesis' which has empirical validation. There's also a strong variant which can be stated as The mean lifetime of a heap allocated object is equal to the mean amount of reachable storage. This is actually mathematically provable by taking Little's Law and setting Λ to 1. Simple proof!
- I'll cover the way heap is organised within G1 on a G1-specific blog post.