Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

In this blog, I share why we should disable UseNUMA flag in JDK in case both CPU and memory node misalign. Please go through the blog for details.

Swati Sharma

Sep. 18, 25 · Analysis

Likes (0)

Comment

Save

2.5K Views

Today, Java is still one of the widely used languages to build and run applications, and for the same reason, organizations prioritize measuring its performance.

When running a Java application on a multi-NUMA (Non-Uniform Memory Access) memory node, we need to pay attention to the remote accesses, if any, otherwise, that can result in increased latencies and hence result in reduced performance. The libnuma kernel library provides several policies, including localalloc, preferred, membind, and interleave, which enable users to affinitize their applications and run them with optimal utilization of the server nodes as per their requirements.

Before understanding the issue, we must have an idea about these policies:

Localalloc is the policy that allows the user to run an application in such a way that memory is allowed to allocate on nodes that are bound through cpunodebind, and in case these NUMA nodes fall short for memory allocation, the application can consume the memory from other NUMA nodes, which means that this policy does not restrict the memory allocation to only a single NUMA node.
Preferred is the policy that allows users to specify the node for allocation, but again does not restrict the allocation to the same node.
Membind policy is more restrictive for memory allocation, which allows the application to allocate memory on only the nodes that are specified.
Interleave policy allows the application to span across the specified nodes equally.

Along with the policy, there are kernel calls to set the mode for memory allocation; further details can be found here.

What Is the Problem With the Current Scenario?

We identified an issue in the memory allocation in JDK with the above policies, as when the application’s affinity is set to a single NUMA node using localalloc, the memory gets interleaved and causes performance degradation, which we observed with all types of GC, such as ParallelGC, G1GC, and ZGC.

We observed the performance issues related to the flag UseNUMA, as when the flag is enabled, the JVM decides to interleave memory among all NUMA nodes, and the CPU threads' access is remote, which causes the performance degradation.

With the figure below, you can understand the binding scenario of CPU and memory. Here, CPU node 0 threads are using memory node 0, and there is a remote access to memory nodes 1, 2, and 3, which is detrimental to performance.

With the GC logs, the issue becomes clear:

Test Case

Bind CPU nodes, and set --localalloc.

    Shell
   
   numactl -N 0 --localalloc java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version

Here, we observe that the failed output shows half of Eden being wasted for threads from a node that will never allocate:

    Plain Text
   
   [0.230s][info][gc,heap,exit ] eden space 524800K, 4% used [0x0000000580100000,0x0000000581580260,0x00000005a0180000)
[0.230s][info][gc,heap,exit ] lgrp 0 space 262400K, 8% used [0x0000000580100000,0x0000000581580260,0x0000000590140000)
[0.230s][info][gc,heap,exit ] lgrp 1 space 262400K, 0% used [0x0000000590140000,0x0000000590140000,0x00000005a0180000)

Before the fix, we observed that, as per the above GC logs, the Eden space gets further divided into lgroups based on the total number of NUMA nodes configured on the system, and ignoring the user bindings.

The Proposed Solution

We proposed a solution to disable the UseNUMA flag when the process gets invoked with incorrect node alignment. We check the NUMA node misalignment through the method numa_equal_bitmask, which returns true if there is a CPU and memory node misalignment with the user-specified binding.

We check the cpunodebind and membind (or Interleave for interleave policy) bitmask equality and disable UseNUMA when they are not equal.

Node Number 0123

    YAML
   
   1100 cpunodebind 
1111 membind

Disable UseNUMA as CPU and memory bitmask are not equal.

The figure below shows that there is remote access to other NUMA nodes.

Node Number 0123

    YAML
   
   1100 cpunodebind 
1100 membind

Enable UseNUMA as CPU and memory bitmask are equal.

The figure below shows a cleaner approach when there is no remote access.

This covers all the cases with all policies, and we have tested with below command:

    Ruby
   
   numactl --cpunodebind=0 --localalloc  java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version

For Localalloc and Preferred policies the membind bitmask returns true for all nodes, hence if cpunodebind is not bound to all nodes the UseNUMA will get disabled.

Bug-ID: Poor performance with UseNUMA when CPU and memory nodes are misaligned.

PR link: (https://github.com/openjdk/jdk/pull/22395)

With profiling data this is evident that remote access exists, also the CPU utilization drops by 20% before applying the patch.

After fix we observed that no lgroups are created as UseNUMA gets disabled when bound to a single node, here good output is not listing any lgrp in the output:

    Perl
   
   [0.231s][info][gc,heap,exit ] eden space 524800K, 8% used [0x0000000580100000,0x0000000582a00260,0x00000005a0180000)

  ... (no lgrps)

As a solution, a warning was printed when the UseNUMA flag was disabled, but this caused a few tests to fail, so another fix removed the log warning for a single NUMA node.

Bug-ID: https://bugs.openjdk.org/browse/JDK-8346834

PR link: https://github.com/openjdk/jdk/pull/22948

How You Can Test

Both flags UseNUMA and UseNUMAInterleave are disabled or enabled based on the numa binding and configuration.

test cases
CPU Binding	Memory Binding	Node Binding	Before Patch	After Patch	Expected Outcome
cpunodebind=0	Membind=0	single node	Disabled	Disabled	Disable NUMA
cpunodebind=0	Membind=0,1	node mismatch	Enabled	Disabled	Disable NUMA
cpunodebind=0,1	Membind=0,1	multi node with node match	Enabled	Enabled	Enable NUMA
N/A	N/A	-	Enabled	Enabled	Enable NUMA

cpunodebind	interleave	single node	Enabled	Disabled	Disable NUMA
cpunodebind	interleave	node mismatch	Enabled	Disabled	Disable NUMA
cpunodebind	interleave	multi node	Enabled	Enabled	Enable NUMA

cpunodebind	localalloc	single node	Enabled	Disabled	Disable NUMA
cpunodebind	localalloc	node mismatch	Enabled	Disabled	Disable NUMA
cpunodebind	localalloc	multi node	Enabled	Enabled	Enable NUMA

cpunodebind	preferred=0	single node	Enabled	Disabled	Disable NUMA
cpunodebind	preferred=0	node mismatch	Enabled	Disabled	Disable NUMA
cpunodebind	preferred	multi node	Enabled	Enabled	Enable NUMA
N/A	preferred	multi node	Enabled	Enabled	Enable NUMA

With the cases above, you can test by running with previous JDK (JDK 24 and older) versus JDK 25 to verify the effect. The command below is sample to test with different policies and with GC algorithms:

    Shell
   
   numactl --cpunodebind=0 --localalloc  java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version

What Is the Impact?

We observed the below performance improvement in both server-side Java throughput and server-side Java throughput under SLA after applying the patch on a 2 NUMA node system with, the similar impact can be observed on a higher (more than 2) NUMA node system.

The performance improvement we observed after applying the fix:

Server-side Java Application	Performance Improvement %
GC Type	Server-side Java throughput (in %)	Server-side Java throughput under SLA (in %)
Parallel GC	8.41	7.19
G1GC	24.39	25.59
ZGC	18.43	21.58

Conclusion

JDK25 and onwards UseNUMA flag will be disabled in case the memory and CPU are not aligned. The perception generally developers have is that localalloc is same as membind but that’s not the scenario as localalloc allows memory allocation on other NUMA nodes in case the current node is out of memory whereas membind restricts the allocation to the bound node.

References:

Note: I would like to thank Derek White and Sourav Paul from Intel for their guidance and a thorough technical review of this blog.

Disclaimer: The views, information or opinions expressed in this technical blog are my personal views/opinions and not of my employer Intel Technology India Pvt Ltd.

Performance improvement Memory (storage engine) Performance

Opinions expressed by DZone contributors are their own.

Related

Trending

Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

In this blog, I share why we should disable UseNUMA flag in JDK in case both CPU and memory node misalign. Please go through the blog for details.

What Is the Problem With the Current Scenario?

Test Case

The Proposed Solution

Node Number 0123

Node Number 0123

How You Can Test

What Is the Impact?

Conclusion

Related

Partner Resources