JDK 17 Memory Bloat in Containers: A Post-Mortem
Upgrading from JDK 8 to JDK 17 spiked container memory from ~50% to 100% due to excessive JVM threads, glibc malloc arenas, and G1GC native allocation.
Join the DZone community and get the full member experience.
Join For FreeWhen engineering teams modernize Java applications, the shift from JDK 8 to newer Long-Term Support (LTS) versions, such as JDK 11, 17, and soon 21, might seem straightforward at first. Since Java maintains backward compatibility, it's easy to assume that the runtime behavior will remain largely unchanged. However, that's far from reality.
In 2025, our team completed a major modernization initiative to migrate all of our Java microservices from JDK 8 to JDK 17. The development and QA phases went smoothly, with no major issues arising. But within hours of deploying to production, we faced a complete system breakdown.
Memory usage, which had been consistently reliable for years, jumped by four times. Containers that had previously operated without issue began to restart repeatedly. Our service level agreements (SLAs) degraded, and incident severity levels escalated. This prompted a multi-day diagnostic effort involving several teams—including platform experts, Java Virtual Machine (JVM) specialists, and service owners.
This post-mortem will cover the following:
- Key differences between JDK 8 and JDK 17
- How containerized environments amplify hidden JVM behaviors
- The distinctions between native memory and heap memory
- The reasons behind thread proliferation and its impact on memory
- The specific commands, flags, and environment variables that resolved our issues
- A validated checklist for anyone upgrading to JDK 17 (or 21)
The problems we faced were subtle and nearly invisible to standard Java monitoring tools. However, the lessons we learned reshaped our approach to upgrading JVM versions and transformed our understanding of memory usage in containerized environments.
The Incident
We deployed the JDK 17 version of our primary service to Kubernetes. The rollout was smooth, health checks turned out green, request latencies remained stable, and the logs showed no errors.
However, 2–3 hours later, our dashboards began lighting up.
Symptoms Observed
| Metric | JDK 8 (Before) | JDK 17 (After) |
|---|---|---|
| Memory usage | ~50% of container | 95–100% (frequent OOMKills) |
| Thread count | ~400 | 1600+ threads |
| Total native memory | ~800 MB | 3.4–3.6 GB |
| Container restarts | None | Multiple/hour |
| GC behavior | Stable | G1GC overhead spikes |
Services that had been stable for years suddenly began to fail unpredictably.
The Challenge: Heap Monitoring Misled Us
Every Java engineer knows to keep an eye on heap usage. Initially, the heap looked perfectly fine, remaining constant around the configured Xmx. However, it was native memory that was surging.
Native memory includes:
- Thread stacks
- glibc malloc arenas
- Auxiliary structures in Garbage Collector (GC)
- JIT compiler buffers
- Metaspace, Code Cache
- NIO buffers
- Internal JVM C++ structures
Unfortunately, this isn’t visible through heap dump tools and isn’t captured by standard Java monitoring. This is exactly what OOMKilled our containers.
Root Cause Analysis
During our investigation, we found that three independent JVM behaviors amplified under containers created a “perfect memory storm.”
After three days of thorough analysis—reviewing heap data, utilizing native memory tracking (jcmd VM.native_memory), sampling thread dumps, examining GC logs, and inspecting container cgroups—we identified three root causes.
Root Cause #1: Thread Proliferation Due to CPU Mis-Detection
What Happened
JDK 17 introduced changes to how Runtime.availableProcessors() functions. Specifically, in versions 17.0.5 and later, a regression caused the Java Virtual Machine (JVM) to ignore cgroup CPU limits and instead read the physical CPU count of the host.
Example:
Container CPU limit: 2 vCPUs
Host machine CPUs: 96
JVM detected: 96 CPUs ❌
This miscalculation caused various parts of the JVM to scale thread creation based on the inflated CPU count, including:
- GC worker threads
- JIT compiler threads
- ForkJoin common pool
- JVMTI threads
- Async logging threads
So instead of:
~50–80 JVM system threads
the JVM spawned:
300–400+ threads
When factoring in application threads (async tasks, thread pools, I/O threads), the total count shot to:
1600+ threads
Why Threads Matter for Memory
Every thread typically reserves ~2 MB of stack by default (native memory). So:
1600 threads × 2 MB = ~3.2 GB native stack memory
Even if those threads remain idle, the stack is reserved. This thread bloat alone pushed us dangerously close to the memory limit of our container.
Root Cause #2: glibc malloc Arena Fragmentation
The thread explosion made things much worse. Glibc manages memory using malloc arenas, and, by default, it allocates:
8 × CPU_COUNT arenas
Due to the JVM incorrectly detecting 96 CPUs, glibc created:
8 × 96 = 768 arenas
A typical arena can consume 10 to 30 MB, depending on fragmentation patterns. Even when arenas are sparsely used, they still occupy virtual memory and contribute to Resident Set Size (RSS). In our case, this resulted in:
~1.5–2.0 GB consumed by glibc arenas
This was invisible to Java monitoring tools and heap analysis.
Root Cause #3: G1GC Native Memory Overhead (800–1000 MB Higher)
Another factor to consider is the shift to Garbage-First Garbage Collector (G1GC) in JDK 17, while JDK 8 commonly used ParallelGC. G1GC is known for using significantly more native memory:
| Component | Approx Native Memory |
|---|---|
| Remembered Sets | 300–400 MB |
| Card Tables | 100–200 MB |
| Region metadata | 200 MB |
| Marking bitmaps | 150+ MB |
| Concurrent refinement buffers | 100 MB |
Total for G1GC:
~800–1000 MB native memory
ParallelGC in JDK 8:
~150–200 MB
Difference:
+650–800 MB
This put us well beyond our container’s 4 GB memory limit.
Combined Memory Explosion Model
Let's look at the combined impact of the three root causes:
Under JDK 8 (~2.8 GB Total)
Heap: 2048 MB
Metaspace: 200 MB
Code Cache: 240 MB
Threads: 80 MB
Native GC: 150 MB
Other native: 100 MB
----------------------------------
Total: ~2.8 GB
Under JDK 17 (~5.4 GB Total)
Heap: 2048 MB
Metaspace: 250 MB
Code Cache: 240 MB
Threads: 200 MB
G1GC: 1000 MB
glibc arenas: 1500 MB
Other native: 150 MB
----------------------------------
Total: ~5.4 GB ❌
This puts us 1.4 GB over the container limit. No amount of heap tuning could have fixed this, because the heap itself was not the underlying problem.
The Fix: A Three-Part Solution
Fix #1: Explicitly Set CPU Count
-XX:ActiveProcessorCount=2
This is the most important setting for containerized Java on JDK 11 and above. It prevents the JVM from scaling threads based on the CPU count of the node.
Fix #2: Limit glibc Malloc Arenas
Set the environment variable:
export MALLOC_ARENA_MAX=2
This reduced native arena overhead from approximately 1.5GB to below 200MB. If you're dealing with very tight memory constrains, consider using:
export MALLOC_ARENA_MAX=1
Fix #3: Tune or Replace G1GC
You have two options here:
- Keep G1GC, but tune it, or
- Switch to ParallelGC, particularly for memory-sensitive workloads.
ParallelGC remains the lowest native memory footprint GC in modern Java.
Our tuning:
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m
After implementing these fixes, we observed that memory usage stabilized in the range of 65% to 70%.
Additional Detection and Observability Improvements
The biggest operational takeaway is clear: relying solely on heap monitoring is not enough. JVM upgrades also require native memory monitoring.
Here's what we've implemented:
Native Memory Tracking (NMT)
We enabled NMT with the command:
-XX:NativeMemoryTracking=summary
From there, we used:
jcmd <pid> VM.native_memory summary
This provided us a detailed breakdown of memory usage across threads, arenas, GC, compiler, etc.
Thread Count Alerts
We established the following:
- Baseline thread counts per service
- Alerts for any increase exceeding 50%
- Dashboards showing thread growth patterns
Increases in thread counts often signal potential native memory leaks.
Monitoring Container-Level Memory Metrics
We shifted our focus to monitoring container-level memory instead of pod-level memory, which aggregates data from multiple containers:
container_memory_working_set_bytes
By concentrating on container-level metrics, we were able to identify memory overshoots sooner and with greater accuracy.
How We Reproduced the Issue Locally
To validate that the issue was inherent to JDK 17, we set up a local environment that mirrored the original setup.
Step 1: Run the Application in Docker
docker run \
--cpus=2 \
--memory=4g \
-e MALLOC_ARENA_MAX=2 \
myservice:java17
Step 2: Inspect CPU Detection
docker exec -it <container> bash
java -XX:+PrintFlagsFinal -version | grep -i cpu
Here's What We Found:
Before the fix:
active_processor_count = 96
After the fix:
active_processor_count = 2
Step 3: Inspect Native Memory:
jcmd <pid> VM.native_memory summary
The arena counts correlated exactly with the detected CPU.
Why This Problem Is Becoming More Common
A number of companies migrating from Java 8 to Java 17 (or 21) are encountering similar challenges. The reasons for this include:
- Containerization exposes previously hidden JVM behaviors.
- Local development machines typically have plenty of RAM and CPU power, unlike Kubernetes containers.
- G1GC has now become the default garbage collector, and its overhead is greater than that of ParallelGC.
- Many servers are equipped with 64 to 128 CPUs, and JVM thread scaling explodes if mis-detected.
- Native memory usage in Java applications is rarely monitored, even in large organizations.
- The behavior of glibc malloc arenas is poorly understood outside the realm of low-level systems engineering.
This combination of factors creates a “trap,” where JVM upgrades might pass all QA tests but may break instantly once deployed in production.
What We Would Do Differently Next Time
JVM Version Soak Testing
Moving forward, we will implement the following requirements:
- A 48-hour load soak
- A 24-hour canary production soak
- Monitoring of thread counts
- Oversight of native memory
- Analysis of GC behavior logs
We've learned that a functional test suite alone is not sufficient.
JVM Upgrade Runbooks
We have developed a runbook that includes:
- Required flags for containers
- Required environment variables (MALLOC_ARENA_MAX)
- Monitoring dashboards to check before promotion
- A rollback decision tree
Rigorous Baseline Establishment
For each service, we will establish baselines for:
- Heap usage
- Native memory
- Thread counts
- GC overhead
Once these baselines are defined, comparing JDK 8 to JDK 17 will become straightforward.
Upgrade Checklist
Pre-Upgrade Steps
- Set
-XX:ActiveProcessorCountexplicitly - Set
MALLOC_ARENA_MAX=1or2 - Choose your garbage collection method: G1GC or ParallelGC
- Enable Native Memory Tracking
- Establish memory baselines for both heap and native memory
- Take note of thread count baselines
- Enable container-level memory metrics
- Conduct soak tests for 24 to 48 hours
- Monitor and validate GC pause times while under load
Post-Deployment Actions
- Observe thread counts for 2 to 6 hours
- Compare native memory usage against your baseline
- Check and validate arena counts
- Ensure CPU detection is accurate
- Rollback immediately if native memory rises more than 10–15% beyond the baseline
Conclusion
The upgrade to JDK 17 served as one of the most instructive incidents our team has encountered.
It highlighted several crucial points:
- Native memory dominates JVM behavior in containers
- CPU detection bugs can silently cripple services
- GC changes between JDK releases can add 500MB+ overhead
- glibc malloc arenas can expand due to excessive thread proliferation
- Monitoring heuristics from JDK 8 become less reliable when transitioning to JDK 17
- Upgrading the JVM must be treated with the same caution as a major infrastructure overhaul, rather than simply a minor version update
The good news?
After applying the recommended fixes, our services now operate more efficiently on JDK 17 than they ever did on JDK 8. We're seeing improved GC throughput, reduced pause times, and improved overall performance.
However, this experience serves as a critical reminder:
Modern Java is fast and powerful but only when configured with an understanding of how the JVM interacts with container runtimes, native memory systems, and Linux allocators.
If you are planning a JDK 17 upgrade, use this guide, validate your assumptions, and closely monitor native memory alongside heap memory.
Opinions expressed by DZone contributors are their own.
Comments