RailsConf 2013 has begun! To celebrate, we’re publishing a series of blog post that highlight what’s new and exciting in the world of New Relic’s Ruby support. Don’t forget to check out the entire series so far: Cross Application Tracing, Thread Profiling, Living on the Edge with Rails 4 & Ruby 2, Thread Safe APIs & Sidekiq Support for Your Threading, and Security Banners (For If You Forgot to Apply the Latest CVE Patch).
If you’ve spent enough time working with a production system (or even just a continuous integration server), you’ve surely encountered a Ruby process that was ‘stuck’. You know the type – not crashing, but not making any real progress either. And giving you no obvious indication of what it was doing.
When finding a stuck process, your primary goal usually is to just get the wheels of progress turning again. And the quickest path to that goal is usually to `kill -9` the stuck process and re-start it.
But I want to you to make your future self a promise. The next time this happens, don’t just kill that stuck process and move on. Dig in and figure out what’s really going on. You might just discover the root cause of your problem, hone your debugging techniques or discover a clue that helps you fix that tricky bug in your favorite open source project that no one has been able to pin down.
Since encountering a stuck process (especially in production) is often stressful, I find it helps to think about the debugging steps you plan to take ahead of time. Create a ‘checklist’ of things to run through so you don’t forget anything or skip any steps. Failing to capture enough information to reproduce a bug usually means you’ll hit it again. So make sure you seize the opportunity when it’s available.
Here are the steps I follow when debugging a stuck Ruby process:
Who Said What?
First, figure out exactly what process is stuck. This sounds trivial, but it’s especially important for systems like Unicorn, Passenger, Resque, etc. where the system is composed of multiple processes that have different roles (often a single master process plus a number of workers).
Identifying exactly which process is stuck is important in order to get a full picture of the behavior of the system as a whole. The trusty ps utility is usually my first stop here — a quick `ps aux | grep <process name>`.
I like to keep a transcript of the exact commands I’ve run, along with their output. And the ps aux output is usually the first thing that goes into that transcript.
Tools like htop and pstree can also be useful here. They give you a visual tree of the relationship between the stuck process, its sibling and its parent process (e.g. a Unicorn master and its workers).
Grab Any Logs that are Available
Again this may seem obvious, but the goal of the checklist is to remind you of everything you need to track. This will give you the best possible chance of debugging the problem offline. Before getting started, consider these questions:
* Where is STDOUT of the stuck process being sent?
* Where is STDERR of the stuck process being sent?
* Are there any other log files created by the stuck process? Check the command line and config file used for the process. Did either specify a log path?
* Are there any other log files created by processes related to the stuck process?
Other Non-Invasive Basics
Try gathering the following before you move into the more invasive actions:
* Grab the list of open files and sockets from the process using lsof -p <PID>
* If you’re on Linux, poke around in /proc/<PID> and consider gathering:
* cmdline contains the command line that was used to launch the process
* cwd is a link to the current working directory of the process
* environ is a list of environment variables set for the process
Check the CPU Usage of the Stuck Process
There are two basic flavors of hangs: static and dynamic. (Note: This is my own terminology, but I think it’s a useful taxonomy.) In a static hang, the threads of the stuck process are stuck in the same exact state for an extended period of time. Examples include traditional deadlocks and processes blocked for extended periods in system calls (often things like ‘read`, `select`, etc.). In a dynamic hang, the threads of the stuck process are changing state, but are caught in some kind of loop that prevents them from making progress (this is sometimes called a ‘livelock’).
The techniques used to debug the two types of hangs are different. So it’s useful to be able to differentiate between them in the wild. A quick heuristic for doing this is to check the CPU usage of the stuck process. Static hangs will generally result in processes with no CPU usage, whereas dynamic hangs will result in processes with non-zero CPU usage (often very high, if the process is stuck in a tight loop).
There are a variety of tools you can use to do this, but a quick look at `top` or `ps -o cpu <pid>` is often enough.
For static hangs where the stuck process demonstrates no CPU usage, gather information on the backtraces of all threads in the stuck process. This is often the most important data. Since the threads aren’t changing state, a single snapshot of thread backtraces is usually sufficient.
There are a variety of different tools for accomplishing this goal, depending on the platform you’re running on and the Ruby implementation you’re using, but I’m going to focus on what is perhaps the most common case: MRI on a Linux host.
`gdb` is a time-tested and widely available tool that can be used to interrogate a running process on a variety of UNIX-ish platforms. You can attach to a running process with gdb by using the -p option, like this:
``` gdb -p ```
Once gdb has successfully attached to the target process, it will drop you at a prompt. The very first thing you should do here is to gather C-level backtraces for all threads of the stuck process, like this:
``` (gdb) t a a bt ```
This is short for `thread apply all backtrace`. Copy this output and save it to your transcript file, even if it’s not immediately informative.
Sometimes C-level backtraces are all you need in order to debug a hang. In other cases though, you really need a Ruby-level backtrace to see what’s going on. Luckily, gdb can help you generate those, too.
Getting a Ruby Backtrace with gdb
Getting a Ruby backtrace out of gdb involves interacting with the Ruby interpreter in the running process that you’ve attached to. And to get that, the interpreter needs to be in a semi-working state. If the root cause of the issue is a Ruby bug, the process may be stuck in such a way that it isn’t possible to get the backtrace at all.
That said, it’s always worth trying. Here’s the technique I use:
Most mechanisms for dumping Ruby backtraces will output them to stdout or stderr. It’s fine if you know where the stdout and stderr for your process is going, and you have access to it. If not, first you’ll need to redirect the output streams to an accessible location.
To do this, we rely on POSIX conventions and some trickery. stdout and stderr are file descriptors 1 and 2 on POSIX-compliant systems. We want to close these file descriptors, and re-open them attached to files that we can actually read from. Closing is easy:
``` (gdb) call (void) close(1) (gdb) call (void) close(2) ```
The call command tells gdb to invoke close from within the target process. Next, we need to re-associate file descriptors 1 and 2 with a file we can actually read from. We could use a file on the filesystem, but it’s even more convenient to see the output directly in gdb. To do that, we want to figure out the device file corresponding to our current TTY and open it twice. Since file descriptors are assigned sequentially, descriptors 1 and 2 will end up being re-associated with our TTY device. Here’s how this looks:
``` (gdb) shell tty /dev/pts/0 (gdb) call (int) open("/dev/pts/0", 2, 0) $1 = 1 (gdb) call (int) open("/dev/pts/0", 2, 0) $2 = 2 ```
Now, any output generated by the target process will be echoed directly to our console so we can see it in our gdb session.
Finally, we need to get the Ruby interpreter in the target process to spit out a backtrace. There are many ways to do this. But the easiest way is just to use gdb to call the rb_backtrace() function from the Ruby interpreter directly:
``` (gdb) call (void)rb_backtrace() ```
With luck, you’ll see a Ruby backtrace dumped to your console.
Digging Deeper with gdb
If you know what you’re looking for in the stuck process, there’s a lot more you can do with gdb. Jon Yurek at thoughtbot has a great post on this topic that explains how you can get gdb to evaluate arbitrary Ruby code in the target process to inspect the values of Ruby variables, get backtraces for all threads and any number of other things
A stuck process that shows high CPU usage is likely stuck in a tight loop. This means that the exact backtraces you get out of it may vary from one sample to the next, but backtraces can still be very helpful to point you in the right direction.
If you find a stuck process with high CPU usage, it may be worth gathering backtraces 2 or 3 times to get a more representative sample. (You can use gdb’s continue command to resume the process after you’ve attached).
You can also use other tracing tools to examine the behavior of the looping process. On Linux, strace -p <pid> allows you to view the system calls being made by the process. If you’re on an OS that has dtrace available, you can use dtruss -p <PID> instead to get a similar output.
Thanks, from your Future Self
You might not always have the luxury of being able to spend time debugging a stuck process. But when you do dive in, you’ll probably be glad you did. I’ve only scratched the surface of the tools available for analyzing problems like this, but hopefully I’ve given you enough information that you’ll be comfortable doing more than just sighing, SIGKILL-ing, and waiting for your problem to reappear.
Bonus: Mac OS X
If you happen to be debugging a stuck process on Mac OS X, check out this sample utility (a handy tool that ships with OS X). It will automate the collection of multiple C-level backtraces, aggregate them together and print out a nicely formatted call tree report showing where the stuck process is spending most of its time.