DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • A Fresh Look at Optimizing Apache Spark Programs
  • Docker Performance Optimization: Real-World Strategies
  • Detecting and Solving Segmentation Faults in Linux Containers
  • Feature Flag Debt: Performance Impact in Enterprise Applications

Trending

  • Stop Poisoning Your Models: How I Built a CV Dataset Quality Toolkit I Can Reuse Forever
  • Introduction to Retrieval Augmented Generation (RAG)
  • How to Format Articles for DZone
  • A 5-Step SOC Guide That Meets RBI Expectations and Strengthens Security Operations
  1. DZone
  2. Software Design and Architecture
  3. Performance
  4. Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

Disabling UseNUMA Flag When CPU and Memory Node Misalign in JDK

In this blog, I share why we should disable UseNUMA flag in JDK in case both CPU and memory node misalign. Please go through the blog for details.

By 
Swati Sharma user avatar
Swati Sharma
·
Sep. 18, 25 · Analysis
Likes (0)
Comment
Save
Tweet
Share
2.3K Views

Join the DZone community and get the full member experience.

Join For Free

Today, Java is still one of the widely used languages to build and run applications, and for the same reason, organizations prioritize measuring its performance. 

When running a Java application on a multi-NUMA (Non-Uniform Memory Access) memory node, we need to pay attention to the remote accesses, if any, otherwise, that can result in increased latencies and hence result in reduced performance. The libnuma kernel library provides several policies, including localalloc, preferred, membind, and interleave, which enable users to affinitize their applications and run them with optimal utilization of the server nodes as per their requirements.  

Before understanding the issue, we must have an idea about these policies: 

  1. Localalloc is the policy that allows the user to run an application in such a way that memory is allowed to allocate on nodes that are bound through cpunodebind, and in case these NUMA nodes fall short for memory allocation, the application can consume the memory from other NUMA nodes, which means that this policy does not restrict the memory allocation to only a single NUMA node. 
  2. Preferred is the policy that allows users to specify the node for allocation, but again does not restrict the allocation to the same node. 
  3. Membind policy is more restrictive for memory allocation, which allows the application to allocate memory on only the nodes that are specified. 
  4. Interleave policy allows the application to span across the specified nodes equally.

Along with the policy, there are kernel calls to set the mode for memory allocation; further details can be found here.

What Is the Problem With the Current Scenario?

We identified an issue in the memory allocation in JDK with the above policies, as when the application’s affinity is set to a single NUMA node using localalloc, the memory gets interleaved and causes performance degradation, which we observed with all types of GC, such as ParallelGC, G1GC, and ZGC. 

We observed the performance issues related to the flag UseNUMA, as when the flag is enabled, the JVM decides to interleave memory among all NUMA nodes, and the CPU threads' access is remote, which causes the performance degradation.

With the figure below, you can understand the binding scenario of CPU and memory. Here, CPU node 0 threads are using memory node 0, and there is a remote access to memory nodes 1, 2, and 3, which is detrimental to performance.


With the GC logs, the issue becomes clear:

Test Case

Bind CPU nodes, and set --localalloc.

Shell
 
numactl -N 0 --localalloc java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version 


Here, we observe that the failed output shows half of Eden being wasted for threads from a node that will never allocate:

Plain Text
 
[0.230s][info][gc,heap,exit ] eden space 524800K, 4% used [0x0000000580100000,0x0000000581580260,0x00000005a0180000)
[0.230s][info][gc,heap,exit ] lgrp 0 space 262400K, 8% used [0x0000000580100000,0x0000000581580260,0x0000000590140000)
[0.230s][info][gc,heap,exit ] lgrp 1 space 262400K, 0% used [0x0000000590140000,0x0000000590140000,0x00000005a0180000) 


Before the fix, we observed that, as per the above GC logs, the Eden space gets further divided into lgroups based on the total number of NUMA nodes configured on the system, and ignoring the user bindings.

The Proposed Solution 

We proposed a solution to disable the UseNUMA flag when the process gets invoked with incorrect node alignment. We check the NUMA node misalignment through the method numa_equal_bitmask, which returns true if there is a CPU and memory node misalignment with the user-specified binding. 

We check the cpunodebind and membind (or Interleave for interleave policy) bitmask equality and disable UseNUMA when they are not equal.  

Node Number 0123 

YAML
 
1100 cpunodebind 
1111 membind 


Disable UseNUMA as CPU and memory bitmask are not equal. 

The figure below shows that there is remote access to other NUMA nodes.


Node Number 0123 

YAML
 
1100 cpunodebind 
1100 membind 


Enable UseNUMA as CPU and memory bitmask are equal. 

The figure below shows a cleaner approach when there is no remote access.


This covers all the cases with all policies, and we have tested with below command: 

Ruby
 
numactl --cpunodebind=0 --localalloc  java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version 


For Localalloc and Preferred policies the membind bitmask returns true for all nodes, hence if cpunodebind is not bound to all nodes the UseNUMA will get disabled.

Bug-ID: Poor performance with UseNUMA when CPU and memory nodes are misaligned.

PR link: (https://github.com/openjdk/jdk/pull/22395)

With profiling data this is evident that remote access exists, also the CPU utilization drops by 20% before applying the patch.

After fix we observed that no lgroups are created as UseNUMA gets disabled when bound to a single node, here good output is not listing any lgrp in the output:

Perl
 
[0.231s][info][gc,heap,exit ] eden space 524800K, 8% used [0x0000000580100000,0x0000000582a00260,0x00000005a0180000)

  ... (no lgrps)


As a solution, a warning was printed when the UseNUMA flag was disabled, but this caused a few tests to fail, so another fix removed the log warning for a single NUMA node.

Bug-ID: https://bugs.openjdk.org/browse/JDK-8346834

PR link: https://github.com/openjdk/jdk/pull/22948

How You Can Test

Both flags UseNUMA and UseNUMAInterleave are disabled or enabled based on the numa binding and configuration.

test cases




CPU Binding

Memory Binding

Node Binding

Before Patch

After Patch

 Expected Outcome

cpunodebind=0

Membind=0

single node

Disabled

Disabled

Disable NUMA

cpunodebind=0

Membind=0,1

node mismatch

Enabled

Disabled

Disable NUMA

cpunodebind=0,1

Membind=0,1

multi node with node match

Enabled

Enabled

Enable NUMA

 N/A

 N/A

 -

Enabled

Enabled

 Enable NUMA




 

 


cpunodebind

interleave

single node

Enabled

Disabled

Disable NUMA

cpunodebind

interleave

node mismatch

Enabled

Disabled

Disable NUMA

cpunodebind

interleave

multi node 

Enabled

Enabled

Enable NUMA

 

 

 

 

 

 

cpunodebind

localalloc

single node

Enabled

Disabled

Disable NUMA

cpunodebind

localalloc

node mismatch

Enabled

Disabled

Disable NUMA

cpunodebind

localalloc

multi node 

Enabled

Enabled

Enable NUMA

 

 

 

 

 

 

cpunodebind

preferred=0

single node

Enabled

Disabled

Disable NUMA

cpunodebind

preferred=0

node mismatch

Enabled

Disabled

Disable NUMA

cpunodebind

preferred

multi node 

Enabled

Enabled

Enable NUMA

N/A

preferred

multi node 

Enabled

Enabled

Enable NUMA

 

With the cases above, you can test by running with previous JDK (JDK 24 and older) versus JDK 25 to verify the effect. The command below is sample to test with different policies and with GC algorithms:

Shell
 
numactl --cpunodebind=0 --localalloc  java -Xlog:gc*=info -XX:+UseParallelGC -XX:+UseNUMA -version 


What Is the Impact?

We observed the below performance improvement in both server-side Java throughput and server-side Java throughput under SLA after applying the patch on a 2 NUMA node system with, the similar impact can be observed on a higher (more than 2) NUMA node system. 

The performance improvement we observed after applying the fix: 


Server-side Java Application
Performance Improvement %

GC Type 

Server-side Java throughput (in %) 

Server-side Java throughput under SLA (in %) 

Parallel GC 

8.41

7.19

G1GC 

24.39

25.59

ZGC 

18.43

21.58

 

Conclusion

JDK25 and onwards UseNUMA flag will be disabled in case the memory and CPU are not aligned. The perception generally developers have is that localalloc is same as membind  but that’s not the scenario as localalloc allows memory allocation on other NUMA nodes in case the current node is out of memory whereas membind restricts the allocation to the bound node.  

References:

  1. https://man7.org/linux/man-pages/man2/set_mempolicy.2.html
  2. https://man7.org/linux/man-pages/man8/numactl.8.html

Note: I would like to thank Derek White and Sourav Paul from Intel for their guidance and a thorough technical review of this blog.

Disclaimer: The views, information or opinions expressed in this technical blog are my personal views/opinions and not of my employer Intel Technology India Pvt Ltd.

Performance improvement Memory (storage engine) Performance

Opinions expressed by DZone contributors are their own.

Related

  • A Fresh Look at Optimizing Apache Spark Programs
  • Docker Performance Optimization: Real-World Strategies
  • Detecting and Solving Segmentation Faults in Linux Containers
  • Feature Flag Debt: Performance Impact in Enterprise Applications

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook