Puppet Server: Advanced Memory Debugging
How to monitor, prevent, and debug memory issues in the open source and proprietary Puppet Server tools.
Join the DZone community and get the full member experience.
Join For FreeOver the past year, we’ve had a handful of issues reported where Puppet Server would accumulate memory over time until it exceeded its configured maximum heap size and shut down with an OutOfMemory error. We’ve put a lot of work into trying to make sure we do everything possible to prevent memory issues like this, and have made some significant improvements as the server has matured. We have multi-master test environments running over 100,000 nodes with no trouble. Even so, there are still some edge cases that can cause issues, and they can be tricky to wrap your head around. Read on for more info.
This post is a deep dive into various types of memory issues, and you'll see some pretty advanced mechanisms you can use to monitor and debug these issues. Step-by-step instructions let you play along at home if you’re interested. The focus here is on Puppet Server, but many of the concepts and tools I cover here are relevant for advanced debugging in any JVM application.
This post is for you if you are an advanced user, and you want to learn some pretty low-level details about things going on inside Puppet Server, or if you want to learn about JVM memory usage. For the most part, we expect and hope that Puppet Server memory settings will be something you need to tune just once, when you’re first getting started, to get things matched up with the load you’ll be putting on the server. If you run into persistent memory issues after that, we’d encourage you to open a bug report on our issue tracker, because we definitely do not expect the average user to need to be aware of all of the details I’ll be covering in this post. The material below is really intended just for folks who’d enjoy an additional peek under the hood.
TL;DR:TL
Before we get to the walkthrough, I’ll start with a “TL;DR” (probably a little “TL” on its own) in case you’re just here to get the view from 10,000 feet. Here's that view:
- Puppet Server can use a lot of memory. This is especially true if you have a lot of Puppet environments in use, and if you’re using the PE node classifier in versions prior to PE 2016.1. If you’re experiencing OutOfMemory issues, as a first step it’s probably not a bad idea to just increase the maximum heap size. It’s hard to provide good recommendations on default values, because these vary so much based on how many environments and how much Puppet code you have. But if you’ve given the server more than, say, 2GB per JRuby instance, that should usually be way more than enough, so something else is probably going on. (There are some rough docs about server memory usage available at https://docs.puppetlabs.com/puppetserver/latest/tuning_guide.html.)
- For leaks in normal Ruby code, we introduced the
max-requests-per-instance
setting, which causes the server to clean up and recycle JRuby instances after they’ve handled a certain number of requests. This should be sufficient for eliminating the vast majority of possible memory leaks. - If, however, you’re using a gem or other Ruby code that is written to be aware that it could be running in the JVM, and if that gem does anything really fancy by calling directly into the JDK API (for example, spawning Java threads),and it’s not cleaning up after itself properly, then it is possible for the gem to have allocated memory that can’t easily be garbage collected by the JVM. In this case, Puppet Server may not be able to do anything about it. The
highline
gem, which provides something akin toreadline
support for Ruby programs, is an example of this. When run under JRuby, this gem calls into some Java APIs that create background threads, and the gem doesn’t clean up after itself properly (as of version 1.6.21, anyway). The moral of the story for this bullet point is, try to avoid loading gems that do fancy Java things on the server, when possible. If there is one that you really need to use, and you find yourself running into OutOfMemory errors on the server, such a gem might be a good place to start your troubleshooting efforts.
Stuff You’ll Need
So now we’ll get into the meat of the post. As I said, this is intended to allow you to play along at home if you desire, so let’s start off by talking about the dev environment and the tools we’re going to be using:
- A memory analyzer. These are tools that are built to analyze a JVM heap dump. I mention this first so you can go ahead and get a download started. :) For the examples in this post, we’re going to use Eclipse MAT, version 1.5.0. I find YourKit Java Profiler to be a superior tool for this kind of work, but MAT is sufficient for what we’ll be doing here, and it’s free, whereas a YourKit license is somewhat spendy.
- A running instance of Puppet Server. I mean, duh, that’s what we’re trying to debug :) But I do want to be specific about version numbers for the examples in this post. I’m running the PE version of Puppet Server from PE 2016.1.1. The easiest way to get yourself set up for this is probably just to install the full PE build. (Note that the open source version of Puppet Server that shipped in PE 2016.1.1 is version 2.3.0, and everything in this post should be doable in open source Puppet as well as PE. You’ll just want to ignore the parts about some of the PE-only features like the node classifier.)
- Curl. Once you have Puppet Server up and running, we’re going to be using curl to simulate catalog requests to the server. So you’ll want to make sure you’ve got your certs all sorted. Here’s the command you need to be able to run: **
curl --key $SSLDIR/private_keys/$CERTNAME.pem --cacert $SSLDIR/certs/ca.pem --cert $SSLDIR/certs/$CERTNAME.pem https://localhost:8140/puppet/v3/catalog/$CERTNAME\?environment\=production
- JDK CLI tools. We’re going to be relying on some debugging tools that ship with the JDK — specifically, jmap and jstack. If you’re on a PE install, you can find these in
/opt/puppetlabs/server/apps/java/lib/jvm/java/bin
.
Tuning Your Environment for Memory Debugging
Now we’re going to make some config changes to your environment, to make it easier to follow along with what’s happening. Some of these changes are things that you would probably never do in a production environment, but for the purposes of this post they’re worthwhile, because they’ll simplify everything a great deal.
- Disable the Puppet agent on your Puppet Server node. You can do this via
systemctl stop puppet
or similar. Doing this has two benefits: a) it will prevent occasional agent runs from generating traffic on your server, so you won’t have to worry about that polluting your experiments, and b) it will prevent the PE modules from messing with your config files, so you can hack on the various settings with Vim and such, rather than needing to go through Hiera or the console. - Set the node classifier’s synchronization-period setting to 0 by editing the file
/etc/puppetlabs/console-services/conf.d/classifier.conf
. This setting controls how often the classifier makes requests to Puppet Server to get data about the Puppet classes; a value of zero indicates that the classifier should not perform this synchronization at all. We want to minimize traffic to the server so that the only requests coming in are the ones from ourcurl
commands. You’ll need to restartpe-console-services
after changing this setting. - Set the number of JRuby instances to 1 by editing the file
/etc/puppetlabs/puppetserver/conf.d/puppetserver.conf
and modifying themax-active-instances
setting. Limiting the system to one JRuby will make it much clearer what’s going on when we look at memory dumps. - Set the max heap size for Puppet Server to 128m by editing the value of
-Xmx
in/etc/sysconfig/pe-puppetserver
. We’re intentionally setting this low so it’ll be easier to exhaust all the memory in the server and cause an OOM. - Disable Puppet Server’s
max-requests-per-instance
feature by setting it to 0 in/etc/puppetlabs/pe-puppetserver/conf.d/puppetserver.conf
. This will prevent any JRuby instances from being flushed. We’ll toggle this setting a few times during the debugging exercise, but for now let’s start with it disabled. - Configure the server to take a heap dump and shut down if an OutOfMemory error occurs. To do this, edit
/etc/sysconfig/pe-puppetserver
and make sure that the JVM args include-XX:+HeapDumpOnOutOfMemoryError
. This will cause the server to dump an.hprof
file containing a memory snapshot when it runs out of memory. You may also set-XX:HeapDumpPath=/path/to/save/heap/dump.hprof
if you would like to be explicit about the directory where the.hprof
file should be created. - Restart
puppetserver
for these last several changes to take effect.
A Pathological Ruby Memory Leak
All right, we’re ready to get started. In this first example, we’re going to write a custom Puppet parser function that intentionally leaks memory so we can force the server to OOM. This will allow us to start exploring heap dump files (.hprof
), learn about settings that will make them more useful, and, a little later on, observe the max-requests-per-instance
setting in action. Ready to kill us a server?
Create a Puppet Function
First thing we need to do is create a Puppet function. We can do that with some commands like this:
mkdir -p /etc/puppetlabs/code/environments/production/modules/oom/lib/puppet/parser/functions
touch /etc/puppetlabs/code/environments/production/modules/oom/lib/puppet/parser/functions/oomer.rb
Now, edit the oomer.rb
file and put these contents into it:
require 'puppet'
class Oomer
def initialize()
Puppet.warning('Created an instance of OOMER!')
end
end
module Puppet::Parser::Functions
newfunction(:oomer) do |args|
o = Oomer.new
end
end
Create a Puppet class that consumes the function
Next, we create a class that consumes the function:
mkdir -p /etc/puppetlabs/code/environments/production/modules/oom/manifests
touch /etc/puppetlabs/code/environments/production/modules/oom/manifests/init.pp
Put these contents in the init.pp
file:
class oom {
oomer()
}
Classify your node to use the Puppet class
You can do this by finding your node in the PE console GUI and adding the new oom class to it, or by putting these contents into a file called /etc/puppetlabs/code/environments/production/manifests/site.pp
:
node default {
include oom
}
At this point, we should be able to issue catalog requests and observe that they are calling our custom function. Run this command:
curl --key $SSLDIR/private_keys/$CERTNAME.pem --cacert $SSLDIR/certs/ca.pem --cert $SSLDIR/certs/$CERTNAME.pem https://localhost:8140/puppet/v3/catalog/$CERTNAME\?environment\=production
You should get some catalog JSON back, and then if you look in the Puppet Server log file (/var/log/puppetlabs/puppetserver/puppetserver.log
), you should see a line like this for each time you run yourcurl
command:
2016-02-25 16:54:22,520 WARN [puppet-server] Puppet Created an instance of OOMER!
All right, so now we’ve got our custom function wired up. But it’s not doing anything evil yet, so it’s not living up to its name. Let’s fix that.
Make Your Function More Evil
A really easy way to create a memory leak is by using Ruby global variables or class variables (which are, effectively, pretty close to the same thing). Class variables aren’t quite as obviously dangerous as global variables, so we’ll use one of those in our example.
Edit your oomer.rb
file to look like this:
require 'puppet'
class Oomer
# class variable
@@oomer_array ||= []
def initialize()
Puppet.warning('Created an instance of OOMER!')
@@oomer_array.push(self)
Puppet.warning("OOMER array size is now: '#{@@oomer_array.size}'")
end
end
module Puppet::Parser::Functions
newfunction(:oomer) do |args|
o = Oomer.new
end
end
Now, if you do some additional curl
runs to simulate catalog requests, you should start seeing output like this in the Puppet Server logs:
2016-02-25 17:08:12,464 WARN [puppet-server] Puppet Created an instance of OOMER!
2016-02-25 17:08:12,464 WARN [puppet-server] Puppet OOMER array size is now: '9'
You should see the counter increment with each catalog request.
So this is a little evil. We’re definitely leaking some memory now. But it’s not enough to bring the server down for a long time, so let’s kick it up a notch. Update the Oomer Ruby class to look like this:
class Oomer
# class variable
@@oomer_array ||= []
def initialize()
Puppet.warning('Created an instance of OOMER!')
# instance variable
@giant_string = "foo" * 100000
@@oomer_array.push(self)
Puppet.warning("OOMER array size is now: '#{@@oomer_array.size}'")
end
end
Now we’re climbing up the evil ladder. With this extra line, we are now creating a String
that uses approximately 300kb each time our function is called. Since we’re leaking a reference to each new instance of this object, that means that every time we compile a catalog, we’re leaking 300kb of memory.
You can probably guess what we’re going to do next >:) — create a bash script, perhaps called something like catalog-loop.sh
, and put these contents into it:
#!/usr/bin/env bash
COUNTER=1
while true ; do
curl -sS -w "\n" --key $SSLDIR/private_keys/$CERTNAME.pem --cacert $SSLDIR/certs/ca.pem --cert $SSLDIR/certs/$CERTNAME.pem https://localhost:8140/puppet/v3/catalog/$CERTNAME\?environment\=production
if [ $? -eq 0 ]
then
echo "catalog request ${COUNTER} successful"
let COUNTER=COUNTER+1
else
echo "catalog request ${COUNTER} failed"
break
fi
done
Then run it, preferably with time, like so:
$ time ./catalog-loop.sh
...
catalog request 92 successful
curl: (52) Empty reply from server
catalog request 93 failed
./catalog-loop.sh 1.73s user 0.26s system 6% cpu 32.604 total
NOTE: if you are running on a systemd operating system such as RHEL7, then the systemd service script for puppetserver will automatically restart the server when it detects a failure. You may need to watch the/var/log/puppetlabs/puppetserver/puppetserver.log
file or journalctl -u pe-puppetserver
to observe when the server has crashed. This should not interfere with the creation of the .hprof
file.
As you can see, on my box this script brought the server down in less than 100 catalog requests, in just over 30 seconds. In addition, there is now a file on my disk called java_pid14480.hprof
. (This file will be created in the working directory of the process, which in PE will be /opt/puppetlabs/share/puppetserver
, by default. Note that this path can be changed via the JVM arg -XX:HeapDumpPath.) The pid number will obviously vary in your environment.
Welp, you’ve done it. You’ve crashed the server. I hope you’re proud of yourself :) Now let’s see what we can do with this.hprof
file that it generated.
Our First HPROF File
So now we’re going to open this hprof
file up in Eclipse MAT. On Linux, there’s a program called MemoryAnalyzer
in the root directory of the unpacked MAT distribution tarball. Running that launches the MAT GUI. From there, click File->Open Heap Dump and browse to your hprof
file. Once you’ve done that, you should see a wizard dialog box pop up. (Full disclosure: I normally click cancel on this wizard dialog, and open up the histogram view of the heap dump, and poke around from there. But for the purposes of this example, the wizard actually does something relatively useful, so let’s go with it.)
Here’s what the wizard looks like:
We’re looking for a leak, so what the heck, let’s check out this Leak Suspects Report. Opening that gives you a view that looks like this:
This is showing us that 25 percent of our memory is consumed by a single object whose type is org.jruby.RubyClass. That sounds promising; let’s click on the Details link and see what else it can show us.
So, a few interesting things here. The first is that, on the right, under the retained heap column, there is a very obvious place where the referenced memory drops off — from 28MB to 24 bytes. If we look to see what the object is that's holding all that memory, we can see it is a property called classVariables on an object of type org.jruby.RubyClass
. That makes a fair amount of sense, given what we know about our pathological leak. Let’s see what other information is available in this view. If we scroll down we find a section called Accumulated Objects in Dominator Tree:
In the Dominator Tree section, we can very clearly see a point in the output where we start getting into the elements of aRubyArray
, and there are objects in here that are about 300kb apiece. These are almost certainly our pathological string objects. From here, digging around further in MAT might provide some utility in terms of figuring out which class these were coming from, and potentially even what variable in the class … but we’re going to pause here for one moment to cheat a bit.
One annoying thing about this data is that everything just shows up as these really generic types, such as RubyObject
. You have to dig pretty deep into the memory data to find anything that would give you a clue as to what the name of the actual Ruby class was. It some cases it may not even be possible to determine that.
Having spent a fair amount of timing doing dev and debug work involving Puppet Server’s use of JRuby, we’ve come across a setting that is exposed by JRuby itself, and that is intended to help in just this sort of situation. In the next section, I'll discuss this setting, and look at how it changes the heap dump output.
Enabling jruby.reify.classes
Edit your /etc/sysconfig/pe-puppetserver
and add the following to the list of JVM arguments: -Djruby.reify.classes=true
. Restart your Puppet Server for this change to take effect.
The flag above tells JRuby that it should take an extra step during its processing of Ruby code, and create actual classes for them in the JVM bytecode. These classes will have names that match the names of the classes in the Ruby code, and will show up in your heap dumps and other memory analysis tools. This can make a huge difference in our ability to identify which parts of the code are causing issues. (It does incur a performance penalty, probably somewhere in the ballpark of 25-33 percent, which is why we don’t enable it by default.)
Let’s run our catalog loop script again to reproduce our pathological leak, and get a new heap dump that includes thereified class info:
$ time ./catalog-loop.sh
...
catalog request 90 successful
curl: (35) Unknown SSL protocol error in connection to localhost:8140
catalog request 91 failed
./catalog-loop.sh 1.48s user 0.20s system 5% cpu 32.521 total
Now, let’s open up the new heap dump in MAT. If we do the Leak Suspects Report, Details, and then scroll down to theObjects in Dominator Tree section as we did before, it looks like this:
Hooray! This is a nice improvement; now it is much more clear that the offending memory leak has something to do with the Oomer Ruby class.
Let’s see what else we can see in MAT. We want to look more closely at the actual objects in memory and see what is holding the references to them. If we click on one of the Oomer instances, we get a pop-up menu that includes a List objects action, which will allow us to inspect one of these instances in more detail. We’ll choose with incoming referencesbecause we’re really interested in seeing where the reference to the Oomer instances is originating:
In the next screen, we’re looking at an individual instance of Oomer, and seeing the chain of objects that hold references to it:
Often I find that when I’m looking at these Incoming References views of Puppet Server memory, my goal is going to be to find some object in the reference tree that looks suspicious or interesting, and then focus in on that object and itsoutgoing references. It’s a lot easier to inspect an object when you’re in the outgoing view, but the incoming view is sometimes crucial for figuring out where the offending memory reference is coming from.
In this case, based on the numbers in the retained heap column, the RubyClass
object looks like a good place to poke. It doesn’t hurt that the item above that in the tree is called classVariables
. That sounds promising, given what we know about the leak.
So now we right-click on that RubyClass, and select list objects->with outgoing references:
So, lots to see here. First of all, notice that the RubyClass
has a reference to itself. You can tell this by looking at the Object ID (hex string: 0xfd666760), and noting that the same one appears in several places with the exact same retained heap size. So you can safely ignore all of these except the first one.
Secondly, notice that the Oomer class does show up in this output; it’s the reifiedClass
property for this RubyClass
object. This confirms that the class we’re looking at is our Oomer class. As far as getting to the bottom of the leak, though, when we expand the reifiedClass
/Oomer
node, the only thing underneath it that references the huge chunk of memory is just another reference to the same RubyClass
object that appears in two other places on this screen.
The green circles above show the most interesting part of this screen:an object called classVariables
whose retained heap size accounts for the vast majority of the memory, but is a slightly different number than what shows up for the RubyClass object. Let’s look into that node a bit further:
In the view above, you can see that we just clicked into the children of the classVariables
node wherever the big memory retained heap sizes indicated. Eventually, we get down to what looks like a HashMap
that's used to store the key/value pairs for the classVariables
, and sure enough, we find a single entry in there with a key called@@oomer_array
, plus a value of type RubyArray
. Drilling into the contents of that array, we can see all of the Oomer
instances that we created with our pathological Puppet parser function.
At this point (if our leak wasn't introduced intentionally), we have some pretty good clues as to where we should go start looking in our Ruby code for the leak: the @@oomer_array
class variable, in the Oomer
class.
If we wanted to dig even deeper, we could poke around inside of one of the Oomer
objects to see where the 300kb are being referenced:
Here we can see that there is a varTable
array inside the object, and that the zero-th entry in that varTable
is our big string of foos. It’d be helpful to be able to see what the variable name is, though; because of the way that JRuby stores this data, we need to go back up and look elsewhere in the class object to find that:
In the image above, you can see that we’ve scrolled up from the classVariables
node and found a node calledvariableNames
; in there we can see that the zero-th variable name is @giant_string
.
Remediating the Leak With max-requests-per-instance
Thus far, we’ve been running the server with an extremely pathological setup, with the goal of crashing the server and getting a heap dump as quickly as possible. In order to illustrate how max-requests-per-instance
can alleviate this particular class of memory issue, we’ll need to take our foot off of the pedal just a bit, and give the server enough breathing room to recover the leaked memory when it has an opportunity to do so.
To that end, we’ll change two things:
- Bump the max heap size up from 128m to 256m (via
-Xmx256m
in the JVM args in/etc/sysconfig/pe-puppetserver
). - Make our pathological leak slightly less egregious. Edit
$codedir/environments/production/modules/oom/lib/puppet/parser/functions/oomer.rb
and change the line where we create the giant string to create only 50000 repetitions of foo instead of 100000: @giant_string = "foo" * 50000
After restarting the server, we should validate that the OOM still occurs with our catalog loop script:
$ time ./catalog-loop.sh
…
catalog request 902 successful
curl: (52) Empty reply from server
catalog request 903 failed
./catalog-loop.sh 18.56s user 2.89s system 12% cpu 2:55.38 total
After lightening up the leak a little and giving the server memory, we can get through about 900 catalog compilations in about three minutes before the server OOMs.
Now we’ll change the max-requests-per-instance
setting to 100 by editing the/etc/puppetlabs/puppetserver/conf.d/puppetserver.conf
file, restart the server, and try the loop again:
$ time ./catalog-loop.sh
…
catalog request 1299 successful
…
catalog request 2500 successful
…
You should see messages like this appearing in your Puppet Server log file:
2016-03-01 10:17:28,757 INFO [p.s.j.jruby-puppet-internal] Flushing JRuby instance 1 because it has exceeded the maximum number of requests (100)
2016-03-01 10:17:28,764 INFO [p.s.j.jruby-puppet-internal] Cleaned up old JRuby instance with id 1.
2016-03-01 10:17:28,765 INFO [p.s.j.jruby-puppet-internal] Creating JRuby instance with id 2.
With the max-requests-per-instance
setting set to 100, the server flushes the JRuby instance after every 100 requests. This effectively frees up all the objects that the Oomer
class was referencing, makes them eligible for garbage collection, and spins up a fresh JRuby instance to handle subsequent agent requests. The new instance will be subject to the same leaks in the Ruby code, but will also be flushed once it’s handled 100 requests. This has an impact on server performance; things will be slower because we have to keep creating and destroying new JRuby instances,, but it’s definitely preferable to the OOM error that would occur otherwise.
Taking Heap Dumps Without An OOM, Comparing HPROF Files
In some development and debugging situations, it can be extremely useful to be able to get a heap dump from a running JVM process without having to wait for the process to OOM. Thankfully there is a CLI command that allows you to achieve just that:
jmap -dump:live,format=b,file=./jvm_heap_dump.hprof <pid>
If you’re curious, take a handful of these while your catalog loop is running with the max-requests-per-instance
fix. If you open one of these HPROF files up in MAT and click the histogram button, and then click the regex box and type in Oomer, you’ll see some info about all of the Oomer objects in memory. What you should see is no more than two or three Oomer class objects; one is the live one for the current JRuby, and there may be one or two others that are eligible for GC, but haven’t been collected yet. In any case, none of these class objects should have more than 100 instances that show up in MAT, because the max-requests-per-instance
setting will flush each JRuby instance after it’s hit the 100 request limit. Here's another useful tip. If you have more than one heap dump open in MAT, and you’re in the histogram view for one of them, a new button will appear on the right: Compare to another Heap Dump. Click this, and it will prompt you for any other heap dump you’d like to compare it to. Then it will show you a histogram that gives a delta of the counts of the objects, by type, between the two snapshots. For example:
This particular screenshot doesn’t necessarily show anything compelling, except for maybe the fact that the number ofOomer
s between these two snapshots hasn’t increased by more than 100, which is the value that we set max-requests-per-instance
to. If we took several hprofs and compared them over time, we’d expect that delta to always be within the range of 100. In any case, I included the screenshot here because in certain cases, this compare to another heap dumpfeature can be extremely useful, and it’s not necessarily super-intuitive to find it in the UI.
Monitoring Object Counts Without a Heap Dump
Another tool that can be useful for debugging memory usage is the jmap
CLI tool. When used with the -histo:live
flag, it shows a count of each type of object to be found in the heap of a process, while that process is running. The syntax is simply jmap -histo:live <pid>
. Here’s an example of grepping that output for Oomer over a few points in time during the max-requests-per-instance
run:
$ jmap -histo:live 10560 |grep Oomer
471: 100 3200 rubyobj.Oomer
475: 99 3168 rubyobj.Oomer
$ jmap -histo:live 10560 |grep Oomer
482: 100 3200 rubyobj.Oomer
486: 99 3168 rubyobj.Oomer
631: 51 1632 rubyobj.Oomer
$ jmap -histo:live 10560 |grep Oomer
465: 99 3168 rubyobj.Oomer
515: 80 2560 rubyobj.Oomer
A few interesting things to call out from this data:
- There are multiple distinct entries for the single classname
rubyobj.Oomer
. This is because JRuby creates a unique classloader for each JRuby instance, and you can have distinct classes with the same name in different classloaders. - With
max-requests-per-instance
set to 100, we know that the JRuby instances should be flushed after 100 requests. So we should never see more than 100 instances ofOomer
on a single line before that JRuby (and its classloader) become inactive and eligible for garbage collection. - If the garbage collection were instantaneous, we should never see more than one listing for
Oomer
in these outputs. However, as you can see here, we’ll usually see two or three lines of output, because it can take a little while for the JRuby instances to be GC’d after we swap in a new one.
Rare Leaks That max-requests-per-instance
Can’t Fix
For any leak that exists purely in Ruby code, and involves accumulating memory in a global, class, or instance variable,max-requests-per-instance
should be able to remediate the issue. However, in some rare cases where there is Ruby code calling directly into a Java API, it’s possible to leak memory in a way that can’t be fixed via max-requests-per-instance
.
We had an example of this come up from a customer support issue recently; in this section I will walk through the steps we took to debug the issue and determine what was going on.
The customer was able to send us an hprof file from a system where they were having OOM issues. They had been able to turn on jruby.reify.classes=true
before capturing the heap dump. After a few iterations and some conversation around what we saw in the heap dump, we were able to determine what was causing the leak.
For the purposes of this post, I set up a reproduction environment that triggered the same issue, and generated a smaller heap dump from that. If you’re interested you can download the heap dump here: https://drive.google.com/open?id=0B8v7HryXOF5NYlZOaVZuU29PWlU
Opening this heap dump and trying the basic things like Leak Suspects didn’t immediately yield anything that seemed too promising. In cases like that, if you know the server in question was using max-requests-per-instance
, it can be useful to look through the histogram for instances of the class org.jruby.Ruby
. There should generally be one of these for each JRuby instance that you’ve configured in max-active-instances
(and possibly an additional one or two that are used for internal JRuby plumbing). If you see a lot more of them, that can be a good indication of a leak, and figuring out where the references to them are being held can be a useful strategy. So, let’s look at the histogram view, withorg.jruby.Ruby$
typed into the regex box:
On this server, we had max-active-instances
set to 1. Depending on how quickly we’re triggering flushes of JRuby instances via max-requests-per-instance
, I can imagine a healthy server where the count of org.jruby.Ruby
instances that we see here was 2, or maybe even 3 if we got really lucky/unlucky in terms of catching the heap dump between GC cycles. Anything higher than that is very fishy, so this looks like an issue. Now we have a suspicion that we are somehow leaking these Ruby instances when they are flushed, and they’re not being freed up properly for garbage collection. The next step is to try to determine where the references to these objects are coming from.
In the JVM, there are different types of memory references; weak or soft references are things that can be garbage-collected if the process needs extra memory, but the GC may be lazy about it. If you want to use weak or soft references, you have to construct them explicitly in your code.The more common and typical type of reference is called a strongreference; these are references that prevent garbage collection from occurring at all.
We’re much more interested in strong references, so now we’re going to ask MAT to tell us about the paths from theseorg.jruby.Ruby
objects to the garbage collection roots — but only for strong references. We can do this by right-clicking on the org.jruby.Ruby
line:
That gives us a screen that looks like this:
There are three root entries here. The first one is a thread which has the word clojure in the name (and which I happen to know from experience is probably a normalreference). Expanding that node, we also see that there are some puppetlabsnamespaces in the reference tree. We’re looking for something that came out of a non-standard library, like a gem, so this one doesn’t seem like it’s likely to be the culprit. We can move on to the other two nodes.
The other two nodes look like they show the same pattern, and the appearance of the word readline in some of those entries sounds super-fishy. There probably aren’t any good use cases for a readline library on the server, since readline is all about CLI user I/O. So let’s dig into these nodes a little further.
My first attempt to dig from here was to look at the incoming references for the readline object, but I didn’t find anything super-compelling in there. It did seem like it kept referencing back to this jline.internal.NonBlockingInputStream
object, so I decided to look into that next. Right-clicking on that and choosing list objects -> with incoming referencesyielded a screen like this:
This view is yielding some much more promising leads. For example, thanks to the jruby.reify
flag, we can see a Ruby class name in this stack: HighLine
. That’s one avenue we can explore further in a moment.
Another thing that jumps out from this screen is that there is a Thread
named NonBlockingInputStreamThread
that is referencing this variable. If you’ve spent any time digging through Puppet Server debugging output, you start to become somewhat familiar with the threads that should be running under normal circumstances, and this one does not look familiar. We can take a more detailed look at the list of threads by clicking on the little gears button in MAT, which says something like, “if available, show each thread’s stack frame.” When sorted by name, that yields a view that looks like this:
Again, it takes a bit of experience before you will be able to start recognizing things that stand apart from a normal Puppet Server thread dump. There’s always a handful of default JVM threads like I/O dispatcher, Signal dispatcher, etc. Then there are some Clojure threads with names like clojure-agent-* and async-dispatch-*. There are often some scheduler threads named things like pool-1-thread-*. And then there will be a bunch of Jetty threads that begin with qtp. These NonBlockingInputStream threads andJLine threads stand out as unusual.
The reason I’m focusing so much on threads here is because threads have two very important properties in the JVM:
They are garbage collection roots, meaning that any memory that they are referencing cannot be freed up until the reference is cleared, or the thread exits. If you execute some code that starts up a thread, and the thread’s body does something like looping and waiting for input, and you don’t ever explicitly clean up that thread, it will hang around forever.
At this point, we have a couple of useful clues as to what is going on with this memory leak. The first is that there are some very weird classes in memory (jline.internal.NonBlockingInputStream
, jline.console.ConsoleReader
) that appear to have to do with a Ruby class called HighLine
. The second is that there appear to be some threads in the thread list that are unaccounted for and are related to jline.
From here, we headed to Google and found that highline
is indeed a gem related to providing readline support. So we asked the customer to send us the output of puppetserver gem list
, and, while we were waiting to hear back on that, we started digging into the source code for highline
.
When we got the gem list back from the customer, it did have highline
in it, and after a little dialog we realized that this gem is a dependency of hiera-eyaml, and that is how it ended up on the customer’s system.
Poking through the source code of highline
, we found some code where it was trying to be clever and detect that it was running under JRuby. If it was running under JRuby, it would construct a global variable of typejline.console.ConsoleReader
. Then we dug into the jline source code and saw that indeed, ConsoleReader
was creating new threads called NonBlockingInputStreamThread
and expecting the caller to call a shutdown method when they were done with the object to clean everything up. The highline
library was not doing that. Worse still, the simple act of doing a require
on the highline
library was enough to trigger this behavior and create these threads that would never exit, even if you didn’t call into any other code in the highline
library.
The gory details can be found in SERVER-1154 if you are interested, so I’ll stop there for now. The key points to take away here are:
max-requests-per-instance
should be able to alleviate any memory leaks that exist in pure Ruby code. But if a Ruby library/gem is doing clever things where it drops down into JDK APIs, it is possible for it to do things that we can’t easily recover from.- If you (or a customer) are using a gem that drops down into Java/JDK APIs, you should be very careful to make sure that it is cleaning up after itself properly. If you are experiencing a memory leak that is not alleviated by
max-requests-per-instance
, there is a good chance that it’s something like thehighline
issue highlighted in this tutorial. Looking out for stray threads that are getting created but not cleaned up properly is a worthwhile direction to explore.
Published at DZone with permission of Chris Price, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments