DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Because the DevOps movement has redefined engineering responsibilities, SREs now have to become stewards of observability strategy.

Apache Cassandra combines the benefits of major NoSQL databases to support data management needs not covered by traditional RDBMS vendors.

The software you build is only as secure as the code that powers it. Learn how malicious code creeps into your software supply chain.

Generative AI has transformed nearly every industry. How can you leverage GenAI to improve your productivity and efficiency?

Related

  • Understanding Root Causes of Out of Memory (OOM) Issues in Java Containers
  • Java Z Garbage Collector (ZGC): Revolutionizing Memory Management
  • Java: How Object Reuse Can Reduce Latency and Improve Performance
  • All You Need To Know About Garbage Collection in Java

Trending

  • Agile’s Quarter-Century Crisis
  • The Future of Java and AI: Coding in 2025
  • Navigating and Modernizing Legacy Codebases: A Developer's Guide to AI-Assisted Code Understanding
  • The Role of AI in Identity and Access Management for Organizations
  1. DZone
  2. Coding
  3. Languages
  4. Monitoring Out-Of-Memory Errors in Your Servers

Monitoring Out-Of-Memory Errors in Your Servers

Out-of-memory errors happen when your machines run into very low memory somehow. There are a lot of things you can do to avoid OOM.

By 
Denny Zhang user avatar
Denny Zhang
·
Mar. 07, 17 · Opinion
Likes (4)
Comment
Save
Tweet
Share
10.6K Views

Join the DZone community and get the full member experience.

Join For Free

In DevOps, you maintain DB instances and RAM intensive services. You see OOM issues occasionally, don’t you? Yes, the scary out-of-memory issues.

Nobody enjoys OOM issues. When they do happen, what should be checked? More importantly, how do we monitor OOM issues and get alerts before something actually happens?

Here are some of my thoughts. Take a look and discuss with me!

What Is OOM?

OOM issues happen when the machines run into very low memory somehow. The OS doesn’t want to run into kernel panic, so as a self-protection, the OS will choose one victim. Usually, the process uses the most RAM. Kill them and release the memory resource.

How to Confirm an OOM Issue

When it happens, the system log will have entries of “killed process.” Thus, we can use grep it like this: dmesg -T | grep -C 5 -i ‘killed process’.

denny@devops:/# dmesg -T | grep -C 5 -i 'killed process'
...
[Tue Feb 21 00:16:39 2017] Out of memory: Kill process 12098 (java) score 655 or sacrifice child
[Tue Feb 21 00:16:39 2017] Killed process 12098 (java) total-vm:223934456kB, anon-rss:17696224kB, file-rss:1153744kB
...

Which Process Is Killed?

When the OS kills the process, it will:

  1. List all processes before killing the victim.
  2. Log the process ID. In our example, we know pid(12098) has been killed.

Use grep $pid to find out which process gets killed.

denny@devops:/# export pid=12098
denny@devops:/# dmesg -T | grep -C 1 "\[$pid\]"
[Tue Feb 21 00:16:39 2017] [11740]     0   11740     3763         42      12       3        0             0 rpc.idmapd
[Tue Feb 21 00:16:39 2017] [12098]   999   12098 55983614    4712492   11091     105        0             0 java
[Tue Feb 21 00:16:39 2017] [22050]     0   22050     3998        629      14       3        0             0 tmux

In the above example, we know some Java program has been killed. Well, I admit, it’s not crystal clear. The good thing is that OOM only happens with processes using a huge amount of RAM. So, in reality, we will always be able to guess which process gets killed.

OOM Exclusion: Show Mercy to My Critical Processes

OS chooses victim by scoring all processes. We can explicitly lower the score of certain processes. So, they might survive, while some other less critical processes are sacrificed. This could buy us more time before things get worse.

Create a flag file for OOM exclusion /proc/$pid/oom_score_adj. The higher value you set for oom_score_adj, the more likely the process will be killed first.

echo -17 > /proc/$pid/oom_score_adj 

Remember: There is no guarantee that your processes will be safe. The machines may run into very low RAM, and the OS might need to sacrifice more processes, yours included.

How to Avoid OOM

Well, you have to make sure processes won’t take all the RAM. Usually, this means:

  1. Reasonable capacity planning. If your cluster needs 100GB RAM in total, but you physically only have 90GB. It simply won’t work.
  2. Enforce a RAM quota for given services. Let’s say process A can only use 2GB RAM at most, and process B can only use 4GB RAM at most. If you’re using Java, XMX and XMS would be your friends.

Honestly speaking, I think you will have little room to avoid OOM if there is not enough available RAM in your servers.

How to Detect OOM Issues

Here is what I do.

Monitor OS Memory Usage

OOM only happens when free memory is very low at the OS level. So, monitoring this would help us to better stay on top of potential OOM issues. If you don’t have scripts enforcing this, check check_linux_stats.pl.

Monitor Memory Usage of Critical Processes

Usually, OOM happens only if certain heavy-weight processes keep taking more and more RAM. See how to monitor process memory usage: check_proc_mem.sh.

Monitor System Log to Detect OOM Incidence

We can keep polling system logs and get alerted when OOM does happen. Currently, I don’t have scripts to check this. If it turns out to be a strong requirement, I’d be happy to implement one and open-source it. 

Memory (storage engine) Open source Monitor (synchronization) garbage collection Machine Java (programming language) cluster Crystal (programming language)

Published at DZone with permission of Denny Zhang, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Understanding Root Causes of Out of Memory (OOM) Issues in Java Containers
  • Java Z Garbage Collector (ZGC): Revolutionizing Memory Management
  • Java: How Object Reuse Can Reduce Latency and Improve Performance
  • All You Need To Know About Garbage Collection in Java

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!