The JVM Pause That Wasn't: A War Story

A high-throughput Java service was stalling. The culprit? Stop-the-World GC pauses were blocked by synchronous log writes to a busy disk.

Nataraj Mocherla

Nov. 06, 25 · Analysis

Likes (11)

Comment

Save

3.7K Views

In high-performance computing, we are trained to hunt for bottlenecks in our code, our algorithms, or our infrastructure. But my favorite bug was not in any of those. It was an invisible interaction between the JVM's garbage collector and the server's disk, resulting in 15+ second, stop-the-world (STW) pauses on a service handling millions of requests per second.

The Mystery: The 503 Spikes

I was working on a large-scale Java service handling millions of user requests per second. The system was designed for extreme throughput, but we were plagued by intermittent spikes in load balancer timeouts, causing 503 responses to be returned to the users.

We found that under load, a subset of our web servers would stall and stop accepting new connections for several seconds, causing requests to pile up and fail. The only clue was that this behavior correlated with periods of heavy disk I/O, which a separate, disk-based caching system on the same host was generating.

The Smoking Gun: Decoding the GC Log

After weeks of debugging, we finally captured the "smoking gun" in the garbage collection logs. A typical Young GC pause should be in the tens or hundreds of milliseconds. We saw this:

    Shell
   
   [timestamp]: 184512.789: [GC [PSYoungGen: 1058042K->17224K(1069568K)] 
    3112024K->2018456K(3258112K), 15.3495220 secs] 
[Times: user=0.25 sys=0.05, real=15.35 secs]

At first glance, this looks like a horribly long GC pause. But the "Aha!" moment is in the [Times] section:

user=0.25 sys=0.05 (Total CPU Time: 0.30s): This is the actual CPU time the GC process used. The GC algorithm itself was incredibly fast and efficient.
real=15.35 secs (Wall-Clock Time): This is the total time that passed in the real world from the moment the pause started to the moment it ended.

The discrepancy was stark: the JVM was in a Stop-the-World state for over 15 seconds, but it was only working (on-CPU) for 0.3 seconds. For the other ~15 seconds, the STW thread was "off-CPU," stuck in a waiting state.

A young GC is a "Stop-the-World" event. The JVM literally halts all application threads to safely move memory around. We discovered that the very last step of this GC operation was to synchronously write the log entry to the GC log file. That simple write() syscall was the killer. Because the disk was under heavy contention from the other caching process, the kernel's I/O queue was saturated. The GC thread's log write, a seemingly harmless operation, got stuck in that queue, waiting for the physical disk. And because the JVM was in an STW pause, the entire application was frozen, waiting for that single log line to be written.

The fix was simple: we stopped writing the GC log to that contended disk.

Two Solutions for This Problem

This fundamental issue, namely a critical, blocking thread waiting on I/O, is as relevant today as it ever was. Here are the two primary ways to solve it.

1. The Filesystem Level Fix (Our solution)

This was our original solution. We changed the GC log path from its default location on a physical disk to a RAM-backed filesystem.

How: We pointed the log output (e.g., -Xloggc:/var/log/my-app/gc.log) to a path in tmpfs, like -Xloggc:/dev/shm/my-app-gc.log.

Why it works: Writes to tmpfs are not true disk I/O. They are memory-to-memory copies that are virtually instantaneous. The write() call returns immediately, and the STW pause ends.

What about memory overruns? This was a valid concern. Writing logs to RAM could, in theory, consume all available memory and crash the server. We mitigated this by using the JVM's built-in log rotation flags. By setting:

     Shell
    
    -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=6 -XX:GCLogFileSize=20M

we capped the total memory usage for GC logs at a predictable 120MB, eliminating the risk of a runaway process.

Trade-offs (Cons):

Logs are ephemeral: The logs in /dev/shm are lost on every system reboot or container restart.
Archiving is lost: This change meant the logs were no longer automatically picked up by our centralized, persistent logging system. This required a separate log-shipping agent (like a sidecar) to be configured if persistence is still needed.

2. The JVM-Level Fix (The Modern Approach)

For many years, the tmpfs hack was the only solution. More recently, a formal JVM feature to add asynchronous GC logging was developed and contributed by the Amazon Corretto team and became a standard feature in OpenJDK 17.

How: Use the async decorator to your -Xlog flag.

     Shell
    
    -Xlog:async -XX:AsyncLogBufferSize=100M

Why it works: The STW GC thread no longer performs the I/O. It writes the log message to a small, in-memory buffer and immediately resumes application threads. A separate, low-priority background thread is then responsible for flushing that buffer to the disk.

Pros:

This is the "official" and intended solution.
Requires no OS-level tricks.
Logs are written to a standard file path, making collection easy.

Cons:

In extreme burst scenarios (unlikely for GC logs), the async buffer could fill, which might cause the main thread to stall.
Not an option on older Java deployments

Why This Still Matters

This problem has been reborn in the age of containers. The modern "best practice" is to have applications log directly to stdout/stderr. But stdout is not a magic void; it's a pipe. Some other process must be reading from the other end of that pipe.

This is typically the container runtime (like containerd) or a logging agent (like Fluentd, Vector, or Logstash). If that logging agent is slow, misconfigured, or it gets blocked on its own network or disk I/O, its read buffer fills up. This backpressure travels up the pipe, and your application's next write() to stdout will block. If the JVM tries to write a GC log to stdout during an STW pause and the logging agent is overwhelmed, you will be right back in the same stall.

Key Takeaways

real vs. user+sys is your signal: When you see a high real time but low user + sys time in any log, it's not a CPU problem. It's an I/O problem (disk, network) or an OS scheduler problem (CPU starvation).
No I/O on a critical path: Never perform blocking I/O, including "simple" logging, on a thread that the entire application depends on.
Use async logging: For modern JVMs, use the -XLog:async flag. It's the cleanest way to move this I/O off the critical path.
Be paranoid about stdout: In a containerized world, logging to stdout is still a blocking I/O call. Ensure your cluster's logging pipeline is robust and non-blocking, or you risk turning a logging delay into an application-wide stall.

CPU time Java virtual machine garbage collection

Opinions expressed by DZone contributors are their own.

Related

Trending