JDK 17 Memory Bloat in Containers: A Post-Mortem

Upgrading from JDK 8 to JDK 17 spiked container memory from ~50% to 100% due to excessive JVM threads, glibc malloc arenas, and G1GC native allocation.

Saumya Tyagi

Dec. 02, 25 · Analysis

Likes (11)

Comment

Save

12.9K Views

When engineering teams modernize Java applications, the shift from JDK 8 to newer Long-Term Support (LTS) versions, such as JDK 11, 17, and soon 21, might seem straightforward at first. Since Java maintains backward compatibility, it's easy to assume that the runtime behavior will remain largely unchanged. However, that's far from reality.

In 2025, our team completed a major modernization initiative to migrate all of our Java microservices from JDK 8 to JDK 17. The development and QA phases went smoothly, with no major issues arising. But within hours of deploying to production, we faced a complete system breakdown.

Memory usage, which had been consistently reliable for years, jumped by four times. Containers that had previously operated without issue began to restart repeatedly. Our service level agreements (SLAs) degraded, and incident severity levels escalated. This prompted a multi-day diagnostic effort involving several teams—including platform experts, Java Virtual Machine (JVM) specialists, and service owners.

This post-mortem will cover the following:

Key differences between JDK 8 and JDK 17
How containerized environments amplify hidden JVM behaviors
The distinctions between native memory and heap memory
The reasons behind thread proliferation and its impact on memory
The specific commands, flags, and environment variables that resolved our issues
A validated checklist for anyone upgrading to JDK 17 (or 21)

The problems we faced were subtle and nearly invisible to standard Java monitoring tools. However, the lessons we learned reshaped our approach to upgrading JVM versions and transformed our understanding of memory usage in containerized environments.

The Incident

We deployed the JDK 17 version of our primary service to Kubernetes. The rollout was smooth, health checks turned out green, request latencies remained stable, and the logs showed no errors.

However, 2–3 hours later, our dashboards began lighting up.

Symptoms Observed

Metric	JDK 8 (Before)	JDK 17 (After)
Memory usage	~50% of container	95–100% (frequent OOMKills)
Thread count	~400	1600+ threads
Total native memory	~800 MB	3.4–3.6 GB
Container restarts	None	Multiple/hour
GC behavior	Stable	G1GC overhead spikes

Services that had been stable for years suddenly began to fail unpredictably.

The Challenge: Heap Monitoring Misled Us

Every Java engineer knows to keep an eye on heap usage. Initially, the heap looked perfectly fine, remaining constant around the configured Xmx. However, it was native memory that was surging.

Native memory includes:

Thread stacks
glibc malloc arenas
Auxiliary structures in Garbage Collector (GC)
JIT compiler buffers
Metaspace, Code Cache
NIO buffers
Internal JVM C++ structures

Unfortunately, this isn’t visible through heap dump tools and isn’t captured by standard Java monitoring. This is exactly what OOMKilled our containers.

Root Cause Analysis

During our investigation, we found that three independent JVM behaviors amplified under containers created a “perfect memory storm.”

After three days of thorough analysis—reviewing heap data, utilizing native memory tracking (jcmd VM.native_memory), sampling thread dumps, examining GC logs, and inspecting container cgroups—we identified three root causes.

Root Cause #1: Thread Proliferation Due to CPU Mis-Detection

What Happened

JDK 17 introduced changes to how Runtime.availableProcessors() functions. Specifically, in versions 17.0.5 and later, a regression caused the Java Virtual Machine (JVM) to ignore cgroup CPU limits and instead read the physical CPU count of the host.

Example:

    Plain Text
   
   Container CPU limit: 2 vCPUs
Host machine CPUs:   96
JVM detected:        96 CPUs ❌

This miscalculation caused various parts of the JVM to scale thread creation based on the inflated CPU count, including:

GC worker threads
JIT compiler threads
ForkJoin common pool
JVMTI threads
Async logging threads

So instead of:

    Plain Text
   
   ~50–80 JVM system threads

the JVM spawned:

    Plain Text
   
   300–400+ threads

When factoring in application threads (async tasks, thread pools, I/O threads), the total count shot to:

    Plain Text
   
   1600+ threads

Why Threads Matter for Memory

Every thread typically reserves ~2 MB of stack by default (native memory). So:

    Plain Text
   
   1600 threads × 2 MB = ~3.2 GB native stack memory

Even if those threads remain idle, the stack is reserved. This thread bloat alone pushed us dangerously close to the memory limit of our container.

Root Cause #2: glibc malloc Arena Fragmentation

The thread explosion made things much worse. Glibc manages memory using malloc arenas, and, by default, it allocates:

    Plain Text
   
   8 × CPU_COUNT arenas

Due to the JVM incorrectly detecting 96 CPUs, glibc created:

    Plain Text
   
   8 × 96 = 768 arenas

A typical arena can consume 10 to 30 MB, depending on fragmentation patterns. Even when arenas are sparsely used, they still occupy virtual memory and contribute to Resident Set Size (RSS). In our case, this resulted in:

    Plain Text
   
   ~1.5–2.0 GB consumed by glibc arenas

This was invisible to Java monitoring tools and heap analysis.

Root Cause #3: G1GC Native Memory Overhead (800–1000 MB Higher)

Another factor to consider is the shift to Garbage-First Garbage Collector (G1GC) in JDK 17, while JDK 8 commonly used ParallelGC. G1GC is known for using significantly more native memory:

Component	Approx Native Memory
Remembered Sets	300–400 MB
Card Tables	100–200 MB
Region metadata	200 MB
Marking bitmaps	150+ MB
Concurrent refinement buffers	100 MB

Total for G1GC:

    Plain Text
   
   ~800–1000 MB native memory

ParallelGC in JDK 8:

    Plain Text
   
   ~150–200 MB

Difference:

    Plain Text
   
   +650–800 MB

This put us well beyond our container’s 4 GB memory limit.

Combined Memory Explosion Model

Let's look at the combined impact of the three root causes:

Under JDK 8 (~2.8 GB Total)

    Plain Text
   
 

   Heap:              2048 MB
Metaspace:          200 MB
Code Cache:         240 MB
Threads:             80 MB
Native GC:          150 MB
Other native:       100 MB
----------------------------------
Total:             ~2.8 GB
  

Under JDK 17 (~5.4 GB Total)

    Plain Text
   
 

   Heap:              2048 MB
Metaspace:          250 MB
Code Cache:         240 MB
Threads:            200 MB
G1GC:              1000 MB
glibc arenas:      1500 MB
Other native:       150 MB
----------------------------------
Total:             ~5.4 GB ❌
  

This puts us 1.4 GB over the container limit. No amount of heap tuning could have fixed this, because the heap itself was not the underlying problem.

The Fix: A Three-Part Solution

Fix #1: Explicitly Set CPU Count

    Plain Text
   
   -XX:ActiveProcessorCount=2

This is the most important setting for containerized Java on JDK 11 and above. It prevents the JVM from scaling threads based on the CPU count of the node.

Fix #2: Limit glibc Malloc Arenas

Set the environment variable:

    Plain Text
   
   export MALLOC_ARENA_MAX=2

This reduced native arena overhead from approximately 1.5GB to below 200MB. If you're dealing with very tight memory constrains, consider using:

    Plain Text
   
   export MALLOC_ARENA_MAX=1

Fix #3: Tune or Replace G1GC

You have two options here:

Keep G1GC, but tune it, or
Switch to ParallelGC, particularly for memory-sensitive workloads.

ParallelGC remains the lowest native memory footprint GC in modern Java.

Our tuning:

    Plain Text
   
   -XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:G1HeapRegionSize=16m

After implementing these fixes, we observed that memory usage stabilized in the range of 65% to 70%.

Additional Detection and Observability Improvements

The biggest operational takeaway is clear: relying solely on heap monitoring is not enough. JVM upgrades also require native memory monitoring.

Here's what we've implemented:

Native Memory Tracking (NMT)

We enabled NMT with the command:

    Plain Text
   
   -XX:NativeMemoryTracking=summary

From there, we used:

    Plain Text
   
   jcmd <pid> VM.native_memory summary

This provided us a detailed breakdown of memory usage across threads, arenas, GC, compiler, etc.

Thread Count Alerts

We established the following:

Baseline thread counts per service
Alerts for any increase exceeding 50%
Dashboards showing thread growth patterns

Increases in thread counts often signal potential native memory leaks.

Monitoring Container-Level Memory Metrics

We shifted our focus to monitoring container-level memory instead of pod-level memory, which aggregates data from multiple containers:

    Plain Text
   
   container_memory_working_set_bytes

By concentrating on container-level metrics, we were able to identify memory overshoots sooner and with greater accuracy.

How We Reproduced the Issue Locally

To validate that the issue was inherent to JDK 17, we set up a local environment that mirrored the original setup.

Step 1: Run the Application in Docker

    Plain Text
   
 

   docker run \
  --cpus=2 \
  --memory=4g \
  -e MALLOC_ARENA_MAX=2 \
  myservice:java17
  

Step 2: Inspect CPU Detection

    Plain Text
   
   docker exec -it <container> bash
java -XX:+PrintFlagsFinal -version | grep -i cpu

Here's What We Found:

Before the fix:

    Plain Text
   
   active_processor_count = 96

After the fix:

    Plain Text
   
   active_processor_count = 2

Step 3: Inspect Native Memory:

    Plain Text
   
   jcmd <pid> VM.native_memory summary

The arena counts correlated exactly with the detected CPU.

Why This Problem Is Becoming More Common

A number of companies migrating from Java 8 to Java 17 (or 21) are encountering similar challenges. The reasons for this include:

Containerization exposes previously hidden JVM behaviors.
Local development machines typically have plenty of RAM and CPU power, unlike Kubernetes containers.
G1GC has now become the default garbage collector, and its overhead is greater than that of ParallelGC.
Many servers are equipped with 64 to 128 CPUs, and JVM thread scaling explodes if mis-detected.
Native memory usage in Java applications is rarely monitored, even in large organizations.
The behavior of glibc malloc arenas is poorly understood outside the realm of low-level systems engineering.

This combination of factors creates a “trap,” where JVM upgrades might pass all QA tests but may break instantly once deployed in production.

What We Would Do Differently Next Time

JVM Version Soak Testing

Moving forward, we will implement the following requirements:

A 48-hour load soak
A 24-hour canary production soak
Monitoring of thread counts
Oversight of native memory
Analysis of GC behavior logs

We've learned that a functional test suite alone is not sufficient.

JVM Upgrade Runbooks

We have developed a runbook that includes:

Required flags for containers
Required environment variables (MALLOC_ARENA_MAX)
Monitoring dashboards to check before promotion
A rollback decision tree

Rigorous Baseline Establishment

For each service, we will establish baselines for:

Heap usage
Native memory
Thread counts
GC overhead

Once these baselines are defined, comparing JDK 8 to JDK 17 will become straightforward.

Upgrade Checklist

Pre-Upgrade Steps

Set -XX:ActiveProcessorCount explicitly
Set MALLOC_ARENA_MAX=1 or 2
Choose your garbage collection method: G1GC or ParallelGC
Enable Native Memory Tracking
Establish memory baselines for both heap and native memory
Take note of thread count baselines
Enable container-level memory metrics
Conduct soak tests for 24 to 48 hours
Monitor and validate GC pause times while under load

Post-Deployment Actions

Observe thread counts for 2 to 6 hours
Compare native memory usage against your baseline
Check and validate arena counts
Ensure CPU detection is accurate
Rollback immediately if native memory rises more than 10–15% beyond the baseline

Conclusion

The upgrade to JDK 17 served as one of the most instructive incidents our team has encountered.
It highlighted several crucial points:

Native memory dominates JVM behavior in containers
CPU detection bugs can silently cripple services
GC changes between JDK releases can add 500MB+ overhead
glibc malloc arenas can expand due to excessive thread proliferation
Monitoring heuristics from JDK 8 become less reliable when transitioning to JDK 17
Upgrading the JVM must be treated with the same caution as a major infrastructure overhaul, rather than simply a minor version update

The good news?

After applying the recommended fixes, our services now operate more efficiently on JDK 17 than they ever did on JDK 8. We're seeing improved GC throughput, reduced pause times, and improved overall performance.

However, this experience serves as a critical reminder:

Modern Java is fast and powerful but only when configured with an understanding of how the JVM interacts with container runtimes, native memory systems, and Linux allocators.

If you are planning a JDK 17 upgrade, use this guide, validate your assumptions, and closely monitor native memory alongside heap memory.

Java Development Kit Java virtual machine applications Memory (storage engine)

Opinions expressed by DZone contributors are their own.

Related

Trending