This article was originally written by Ariel Weisberg for the VoltDB blog.
Some users of Java, and specifically Java as it is used in in-memory database technology, worry Java’s ‘Garbage Collection’ can impact performance. Garbage Collection is intended to simplify memory management (allocation and de-allocation) and therefore reduce the amount of code written, speed development, and avoid memory management bugs. Garbage collection typically works by copying actively used (live) objects or tracking free memory occupied by unused (garbage) objects.
The primary concern is that most implementations of garbage collection stop (pause) application execution for some amount of time. Any requests that arrive while the application is paused will not receive a response until the garbage collection completes. Garbage collection pauses can occur anywhere from several times a second, to every handful of seconds, to every few days and can last from anywhere from milliseconds, to seconds, and in extreme cases minutes to hours.
There are many factors that impact the frequency and length of garbage collection pauses, but the biggest factor is application design. So where on the continuum of garbage collection frequency and pause time does VoltDB fall?
Fortunately garbage collection pauses times are not a factor in response times for VoltDB.
VoltDB stores table data off the heap. The memory is allocated by the SQL Execution Engine processes, which are written in C++. The Java heap in VoltDB is used for relatively static deployment and schema-related data, and for short-term data as it handles requests and responses. Much of this is kept off-heap using direct byte buffers and other structures (read more about that here).
In VoltDB, the heap is only used for scratch space during transaction routing and stored procedure execution. As a result, data for each transaction lives only for a few milliseconds and almost never ends up being promoted or live during a GC. Actual SQL execution takes place off heap as well, so temp tables don’t generate garbage.
GCs in VoltDB should represent < 1% of execution time. You can choose the percentage by sizing the young generation appropriately. Real world deployments tend to do a young gen GC every handful of seconds, during which the GCs should only block for single-digit milliseconds. Old gen GCs should be infrequent, on the order of days, and should only block for 10s of milliseconds. These GCs can be invoked manually to ensure they happen during off-peak times.
Concurrent GCs across nodes typically aren’t an issue. For example, a worst-case scenario, in which every node is a dependency for a transaction and does GC back-to-back, latency impacts will be the sum of the number of involved nodes. Organizations with this set-up should measure to see if GC impacts throughput for a period of time that matters.
VoltDB put a lot of effort into latency in the most recent release; below we’ve shared one of the KPIs.
The example is a three-node benchmark of 50/50 read/write of 32 byte keys and 1024 byte values. There is a single client with 50 threads. There is a node failure during the benchmark and the benchmark runs for 30 minutes. This is not a throughput benchmark, so there is only one client instance with a smallish number of threads.
Average throughput: 94,114 txns/sec Average latency: 0.46 ms 10th percentile latency: 0.26 ms 25th percentile latency: 0.32 ms 50th percentile latency: 0.45 ms 75th percentile latency: 0.54 ms 90th percentile latency: 0.61 ms 95th percentile latency: 0.67 ms 99th percentile latency: 0.83 ms 99.5th percentile latency: 1.44 ms 99.9th percentile latency: 3.65 ms 99.999th percentile latency: 16.00 ms
A quick analysis of the numbers, correlated with other events and metrics, demonstrates that GC is not a factor even at high percentiles. Hotspot’s ParNew collector performs well for applications that can keep the working set small and avoid promotion.
Databases that store data on heap do have to be more concerned about GC pauses. At VoltDB we are only concerned about them because we are frequently evaluated by maximum pause time, not average pause time or pause time at some percentile.
I’d be interested in how you manage GC in your VoltDB implementations. Looking forward to your comments