DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service

Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.

Monitoring and Observability for LLMs: Datadog and Google Cloud discuss how to achieve optimal AI model performance.

Automated Testing: The latest on architecture, TDD, and the benefits of AI and low-code tools.

Related

  • Python Variables Declaration
  • Linked List in Data Structures and Algorithms
  • From CPU to Memory: Techniques for Tracking Resource Consumption Over Time
  • How We Solved an OOM Issue in TiDB with GOMEMLIMIT

Trending

  • Exploring the Evolution and Impact of Computer Networks
  • Harnessing the Power of In-Memory Databases: Unleashing Real-Time Data Processing
  • Introduction to ESP32 for Beginners Using the Xedge32 Lua IDE
  • The Promise of Personal Data for Better Living

Out of Memory: Kill Process or Sacrifice Child

Nikita Salnikov-Tarnovski user avatar by
Nikita Salnikov-Tarnovski
·
Jun. 07, 14 · Interview
Like (2)
Save
Tweet
Share
24.81K Views

Join the DZone community and get the full member experience.

Join For Free

It is 6 AM. I am awake summarizing the sequence of events leading to my way-too-early wake up call. As those stories start, my phone alarm went off. Sleepy and grumpy me checked the phone to see whether I was really crazy enough to set the wake-up alarm at 5AM. No, it was our monitoring system indicating that one of Plumbr services went down.

As a seasoned veteran in the domain, I made the first correct step towards solution by turning on the espresso machine. With a cup of coffee I was equipped to tackle the problems. First suspect, application itself seemed to have behave completely normal before the crash. No errors, no warning signs, no trace of any suspects in the application logs.

The monitoring we have in place had noticed the death of the process and had already restarted the crashed service. But as I already had caffeine in my bloodstream, I started to gather more evidence. 30 minutes later I found myself staring at the following in the /var/log/kern.log :

Jun  4 07:41:59 plumbr kernel: [70667120.897649] Out of memory: Kill process 29957 (java) score 366 or sacrifice child
Jun  4 07:41:59 plumbr kernel: [70667120.897701] Killed process 29957 (java) total-vm:2532680kB, anon-rss:1416508kB, file-rss:0kB

Apparently we became victims of the Linux kernel internals. As you all know, Linux is built with a  bunch of unholy creatures ( called ‘daemons’). Those daemons are shepherded by several kernel jobs, one of which seems to be especially sinister entity. Apparently all modern Linux kernels have a built-in mechanism called “Out Of Memory killer” which can annihilate your processes under extremely low memory conditions. When such a condition is detected, the killer is activated and picks a process to kill. The target is picked using a set of heuristics scoring all processes and selecting the one with the worst score to kill.

Understanding the “Out Of Memory killer”

By default, Linux kernels allow processes to request more memory than currently available in the system. This makes all the sense in the world, considering that most of the processes never actually use all of the memory they allocate. The easiest comparison to this approach would be with the cable operators. They sell all the consumers a 100Mbit download promise, far exceeding the actual bandwidth present in their network. The bet is again on the fact that the users will not simultaneously all use their allocated download limit. Thus one 10Gbit link can successfully serve way more than the 100 users our simple math would permit.

A side effect of such approach is visible in case some of your programs is on the path of depleting the system’s memory.This can lead to extremely low memory conditions, where no pages can be allocated to process. You might have faced such situation, where not even a root account cannot kill the offending task. To prevent such situations, the killer activates, and identifies the process to be the killed.

You can read more about fine-tuning the behaviour of “Out of memory killer” from this article in RedHat documentation.

What was triggering the Out of memory killer?

Now that we have the context, it is still unclear what was triggering the “killer” and woke me up at 5AM? Some more investigation revealed that:

  • The configuration in /proc/sys/vm/overcommit_memory allowed overcommitting memory – it was set to 1, indicating that every malloc() should succeed.
  • The application was running on a EC2 m1.small instance. EC2 instances have disabled swapping by default.

Those two facts combined with the sudden spike in traffic in our services resulted in the application requesting more and more memory to support those extra users. Overcommitting configuration allowed to allocate more and more memory for this greedy process, eventually triggering the “Out of memory killer” who was doing exactly what it is meant to do. Killing our application and waking me up in the middle of the night.

Example

When I described the behaviour to engineers, one of them was interested enough to create a small test case reproducing the error. When you compile and launch the following Java code snippet on Linux (I used the latest stable Ubuntu version):

package eu.plumbr.demo;
public class OOM {

public static void main(String[] args){
java.util.List l = new java.util.ArrayList();
for (int i = 10000; i < 100000; i++) {
			try {
				l.add(new int[100_000_000]);
			} catch (Throwable t) {
				t.printStackTrace();
			}
		}
}
}

then you will face the very same Out of memory: Kill process <PID> (java) score <SCORE> or sacrifice child message.

Note that you might need to tweak the swapfile and heap sizes, in my testcase I used the 2g heap specified via -Xmx2g and following configuration for swap:

swapoff -a 
dd if=/dev/zero of=swapfile bs=1024 count=655360
mkswap swapfile
swapon swapfile

Solution?

There are several ways to handle such situation. In our example, we just migrated the system to an instance with more memory. I also considered allowing swapping, but after consulting with engineering I was reminded of the fact that garbage collection processes on JVM are not good at operating under swapping, so this option was off the table.

Other possibilities would involve fine-tuning the OOM killer, scaling the load horizontally across several small instances or reducing the memory requirements of the application.

Memory (storage engine)

Published at DZone with permission of Nikita Salnikov-Tarnovski, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Python Variables Declaration
  • Linked List in Data Structures and Algorithms
  • From CPU to Memory: Techniques for Tracking Resource Consumption Over Time
  • How We Solved an OOM Issue in TiDB with GOMEMLIMIT

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: