In DevOps, you maintain DB instances and RAM intensive services. You see OOM issues occasionally, don’t you? Yes, the scary out-of-memory issues.
Nobody enjoys OOM issues. When they do happen, what should be checked? More importantly, how do we monitor OOM issues and get alerts before something actually happens?
Here are some of my thoughts. Take a look and discuss with me!
What Is OOM?
OOM issues happen when the machines run into very low memory somehow. The OS doesn’t want to run into kernel panic, so as a self-protection, the OS will choose one victim. Usually, the process uses the most RAM. Kill them and release the memory resource.
How to Confirm an OOM Issue
When it happens, the system log will have entries of “killed process.” Thus, we can use grep it like this:
dmesg -T | grep -C 5 -i ‘killed process’.
denny@devops:/# dmesg -T | grep -C 5 -i 'killed process' ... [Tue Feb 21 00:16:39 2017] Out of memory: Kill process 12098 (java) score 655 or sacrifice child [Tue Feb 21 00:16:39 2017] Killed process 12098 (java) total-vm:223934456kB, anon-rss:17696224kB, file-rss:1153744kB ...
Which Process Is Killed?
When the OS kills the process, it will:
- List all processes before killing the victim.
- Log the process ID. In our example, we know
pid(12098)has been killed.
grep $pid to find out which process gets killed.
denny@devops:/# export pid=12098 denny@devops:/# dmesg -T | grep -C 1 "\[$pid\]" [Tue Feb 21 00:16:39 2017]  0 11740 3763 42 12 3 0 0 rpc.idmapd [Tue Feb 21 00:16:39 2017]  999 12098 55983614 4712492 11091 105 0 0 java [Tue Feb 21 00:16:39 2017]  0 22050 3998 629 14 3 0 0 tmux
In the above example, we know some Java program has been killed. Well, I admit, it’s not crystal clear. The good thing is that OOM only happens with processes using a huge amount of RAM. So, in reality, we will always be able to guess which process gets killed.
OOM Exclusion: Show Mercy to My Critical Processes
OS chooses victim by scoring all processes. We can explicitly lower the score of certain processes. So, they might survive, while some other less critical processes are sacrificed. This could buy us more time before things get worse.
Create a flag file for OOM exclusion
/proc/$pid/oom_score_adj. The higher value you set for
oom_score_adj, the more likely the process will be killed first.
echo -17 > /proc/$pid/oom_score_adj
Remember: There is no guarantee that your processes will be safe. The machines may run into very low RAM, and the OS might need to sacrifice more processes, yours included.
How to Avoid OOM
Well, you have to make sure processes won’t take all the RAM. Usually, this means:
- Reasonable capacity planning. If your cluster needs 100GB RAM in total, but you physically only have 90GB. It simply won’t work.
- Enforce a RAM quota for given services. Let’s say process A can only use 2GB RAM at most, and process B can only use 4GB RAM at most. If you’re using Java, XMX and XMS would be your friends.
Honestly speaking, I think you will have little room to avoid OOM if there is not enough available RAM in your servers.
How to Detect OOM Issues
Here is what I do.
Monitor OS Memory Usage
OOM only happens when free memory is very low at the OS level. So, monitoring this would help us to better stay on top of potential OOM issues. If you don’t have scripts enforcing this, check check_linux_stats.pl.
Monitor Memory Usage of Critical Processes
Usually, OOM happens only if certain heavy-weight processes keep taking more and more RAM. See how to monitor process memory usage: check_proc_mem.sh.
Monitor System Log to Detect OOM Incidence
We can keep polling system logs and get alerted when OOM does happen. Currently, I don’t have scripts to check this. If it turns out to be a strong requirement, I’d be happy to implement one and open-source it.