DZone
Performance Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Performance Zone > Jumping Off the Ruby Memory Cliff

Jumping Off the Ruby Memory Cliff

In this solution, the bad news is that you’re still jumping off a cliff. The good news is that at least you know why and the approximate location of the cliff.

Richard Schneeman user avatar by
Richard Schneeman
·
Apr. 13, 17 · Performance Zone · Tutorial
Like (1)
Save
Tweet
2.47K Views

Join the DZone community and get the full member experience.

Join For Free

The memory use of a healthy app is like the heartbeat of a patient: regular and predictable. You should see a slow steady climb that eventually plateaus — hopefully, before you hit the RAM limit on your server:

Image title

Like a patient suffering from a heart ailment, the shape of our metrics can help us determine what dire problem our app is having. Here’s one example of a problem:

Image title

Memory spikes way up. The dark purple at the bottom indicates that we are now swapping significantly. The dotted line is our memory limit, which you can see we go well over. This sharp spike would indicate a problem. However, it’s curious how the memory simply drops off and goes back down after a bit. What’s going on?

If you’re running on Heroku, you’ll likely see a cluster of H12 - Request timeout errors at the same time. What is going on?

If you’re using a web server that has multiple processes (called forks or workers), one explanation is that some request or series of requests, came in that were so expensive they locked up the child process. Maybe you accidentally coded in an infinite loop, or perhaps you’re trying to load a million records from the database at once, whatever the reason, that request needs a lot of resources. When this happens things grind to a halt. The process starts aggressively swapping to disk and can’t run any code.

Lucky for us, when this happens, web servers like Puma have a failsafe in place. Every so often, the child process sends a “heartbeat” notification to the parent to let it know that it’s still alive and doing well. If the parent process does not get this notification for a period of time, it will reap the process (by sending it a SIGTERM) telling it to shut down, and starting a new replacement process.

That’s what’s going on here. The child process was hung and wa using a lot of resources. Eventually, it hits the “worker timeout,” the default of which is 60 seconds and the process then gets killed. There are a few reasons why the problem persists longer than one minute — there may be multiple problem processes, the child process might not shut down right away, or the server may use so many resources that the parent process is having a hard time getting CPU resources to even check for a timeout.

Another memory signature you might see looks like this:

Image title

It’s kinda like a sawtooth or a group of sharks. In this case, the memory isn’t suddenly spiking as badly as before and something is making it go back down. It still has a drop-off cliff at the end. This is likely due to an app using something like Puma Worker Killer with rolling restarts. Recent versions of Ruby will free memory back to the operating system, but very conservatively. It doesn’t want to give memory back to the OS just to turn around and have to ask for more a second later. So essentially that cliff indicates that something is dying, either intentionally or otherwise.

Now you know why your app is “jumping off the memory cliff,” what can you do? You need to find the endpoint that is taking up all your resources and fix it. This is extremely tricky because any performance tools such as Scout, won’t have the resources to report back analytics while the process it is running in is stuck. This means you’re more or less flying blind. My best advice would be to start with your H12 errors and use a logging add-on that you can search through. Find what endpoints are causing H12 errors then try to reproduce the problem locally using rack-mini-profiler or derailed benchmarks. I would look at the first H12 error in the series first.

If you can’t reproduce, add more logging around that endpoint — maybe it only happens with a specific user or at a certain time of day, etc. The idea is to get your local app as close to the conditions causing the problem as possible.

The bad news is that you’re still jumping off a cliff. The good news is that at least you know why and the approximate location of the cliff. It might take a while to narrow down the problem and make it go away, but after that, you can hit the ground running.

Memory (storage engine)

Published at DZone with permission of Richard Schneeman, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Pattern Matching for Switch
  • API Security Weekly: Issue 165
  • Streaming ETL with Apache Kafka in the Healthcare Industry
  • Datafaker: An Alternative to Using Production Data

Comments

Performance Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo