Monitoring Out-Of-Memory Errors in Your Servers

Out-of-memory errors happen when your machines run into very low memory somehow. There are a lot of things you can do to avoid OOM.

Denny Zhang

Mar. 07, 17 · Opinion

Likes (4)

Comment

Save

10.6K Views

In DevOps, you maintain DB instances and RAM intensive services. You see OOM issues occasionally, don’t you? Yes, the scary out-of-memory issues.

Nobody enjoys OOM issues. When they do happen, what should be checked? More importantly, how do we monitor OOM issues and get alerts before something actually happens?

Here are some of my thoughts. Take a look and discuss with me!

What Is OOM?

OOM issues happen when the machines run into very low memory somehow. The OS doesn’t want to run into kernel panic, so as a self-protection, the OS will choose one victim. Usually, the process uses the most RAM. Kill them and release the memory resource.

How to Confirm an OOM Issue

When it happens, the system log will have entries of “killed process.” Thus, we can use grep it like this: dmesg -T | grep -C 5 -i ‘killed process’.

denny@devops:/# dmesg -T | grep -C 5 -i 'killed process'
...
[Tue Feb 21 00:16:39 2017] Out of memory: Kill process 12098 (java) score 655 or sacrifice child
[Tue Feb 21 00:16:39 2017] Killed process 12098 (java) total-vm:223934456kB, anon-rss:17696224kB, file-rss:1153744kB
...

Which Process Is Killed?

When the OS kills the process, it will:

List all processes before killing the victim.
Log the process ID. In our example, we know pid(12098) has been killed.

Use grep $pid to find out which process gets killed.

denny@devops:/# export pid=12098
denny@devops:/# dmesg -T | grep -C 1 "\[$pid\]"
[Tue Feb 21 00:16:39 2017] [11740]     0   11740     3763         42      12       3        0             0 rpc.idmapd
[Tue Feb 21 00:16:39 2017] [12098]   999   12098 55983614    4712492   11091     105        0             0 java
[Tue Feb 21 00:16:39 2017] [22050]     0   22050     3998        629      14       3        0             0 tmux

In the above example, we know some Java program has been killed. Well, I admit, it’s not crystal clear. The good thing is that OOM only happens with processes using a huge amount of RAM. So, in reality, we will always be able to guess which process gets killed.

OOM Exclusion: Show Mercy to My Critical Processes

OS chooses victim by scoring all processes. We can explicitly lower the score of certain processes. So, they might survive, while some other less critical processes are sacrificed. This could buy us more time before things get worse.

Create a flag file for OOM exclusion /proc/$pid/oom_score_adj. The higher value you set for oom_score_adj, the more likely the process will be killed first.

echo -17 > /proc/$pid/oom_score_adj

Remember: There is no guarantee that your processes will be safe. The machines may run into very low RAM, and the OS might need to sacrifice more processes, yours included.

How to Avoid OOM

Well, you have to make sure processes won’t take all the RAM. Usually, this means:

Reasonable capacity planning. If your cluster needs 100GB RAM in total, but you physically only have 90GB. It simply won’t work.
Enforce a RAM quota for given services. Let’s say process A can only use 2GB RAM at most, and process B can only use 4GB RAM at most. If you’re using Java, XMX and XMS would be your friends.

Honestly speaking, I think you will have little room to avoid OOM if there is not enough available RAM in your servers.

How to Detect OOM Issues

Here is what I do.

Monitor OS Memory Usage

OOM only happens when free memory is very low at the OS level. So, monitoring this would help us to better stay on top of potential OOM issues. If you don’t have scripts enforcing this, check check_linux_stats.pl.

Monitor Memory Usage of Critical Processes

Usually, OOM happens only if certain heavy-weight processes keep taking more and more RAM. See how to monitor process memory usage: check_proc_mem.sh.

Monitor System Log to Detect OOM Incidence

We can keep polling system logs and get alerted when OOM does happen. Currently, I don’t have scripts to check this. If it turns out to be a strong requirement, I’d be happy to implement one and open-source it.

Memory (storage engine) Open source Monitor (synchronization) garbage collection Machine Java (programming language) cluster Crystal (programming language)

Published at DZone with permission of Denny Zhang, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

Trending