Monitoring Background Jobs in Ruby’s Resque
Here at AppNeta, we get to see a lot about how people build their web applications. From simple PHP scripts to heavily service-oriented Java clouds to monolithic Django apps, everybody’s product is architected a little differently. We’re still out to trace everything, and today I want to talk how to get visibility into an important component of any complex system: the messaging queue. Specifically, let’s look at how to trace a job from Rails using Resque.
If you haven’t used a messaging queue in your app, the idea is simple. Instead of forcing all the work to happen during the request, while the user is waiting, you can delay some of the more time-consuming tasks. You can do anything in these tasks, ranging from a simple insert to kicking off a series of user analytics that touch all parts of your infrastructure. The advantage is that you can return a speedy response to the user, or, if they are actually waiting on the task results, give them a better loading interface than a white screen and browser loading bar.
A Quick Resque Tutorial
In Ruby, Resque is a task runner, which by default stores the task descriptions in Redis (though other options are available). Resque jobs are just Ruby classes, with a single mandatory method
perform. Resque will call
with the arguments given in the task description. Let’s look at a
minimal task, that takes a single argument and prints it. (Useless, I
@queue variable defines a name that a worker can
bind to, in case you want to spread different types of jobs across
different machines. To create a task that this worker could run, we just
call it from our request:
And that’s our job! Maybe not the most interesting job, and probably not prone to performance issues, but we don’t know that yet. So let’s measure it!
Tracing a Resque Task
Now that we’ve added this to our system, we should have monitoring around it. The easiest way to do this would be to just measure the time each task takes, and log that information:
Unfortunately, the data presentation here leaves a bit to be desired, so I’m going to use TraceView
to log this information instead. This also has the benefit of logging
any SQL queries, cache accesses, or service calls that we might do in a
more complex task, as well as reporting errors. To start a trace
fresh, we can wrap this call in the
That’s a start! We’ve now got some visibility into our Resque jobs, and we can rest easy knowing that this is running smoothly in production.
Tracing a Resque Job (with multiple tasks!)
For cron-style jobs, the approach of tracing each task individually works fine. For reference, let’s look at the events we’re generating with that code:
Pretty straightforward. Now let’s consider a more complicated set of tasks: a document-processing pipeline. That code might look like this:
In this case, our first task takes a document, and the second one archives it. If we have multiple tasks, each one gets logged separately, and we can figure out same statistics for each — average, std. dev., percentiles, and the like. But what if you have a job that spans multiple tasks? We can further aggregate the stats, but we might be starting to miss things, like large inputs that cause the entire pipeline to slow down.
What we’d really like is to correlate the related tasks, so instead of timing the each task, we’re timing the entire job. Under the hood, TraceView generates a token for each request. If we pass this ID (generally stored in xtrace, after the X-Trace header it’s passed around in) to each task, we can correlate those timings before storing them, and retrieve them all together. To do this, we can modify each task to take this token, and trace using that ID. ProcessDoc then becomes:
Now we need to start the trace somewhere, but we’re not doing it in the job. We could start it in the first task, or we could link this one step further up the chain and tie it back to the web request that started it in the first place. In a default rails stack, that request generates the following events:
To add in the task queue call to the logged request, we can call the following function:
We have to force a fork in the execution path to indicate that we’re
running an asynchronous task, possibly in parallel, with the rest of
the web request, which is done with the call to
fromString. Aside from that, this is the same underlying call as is done by
start_trace above — log that we’re entering a named block of code, and start timing it.
When we put it all together, we get a secondary execution path attached to the web request, and the logged events look like this:
Now we’ve got everything: the original request, all tasks, individual timing information, and a global view of how the process performed. Not that we now have an additional timing measurement here: the delay before starting the task at all. In this case, we waited a full 500ms between queuing the job actually executing it! Once we were in the pipeline, the tasks happened much faster (only 25ms between processing and archiving).
Lest you think that everything was easy, there’s a couple things to keep in mind when you use this in your own application.
- Because we’re starting the timing in the web request and ending it in a task queue, we’re relying on those two processes to have an identical clock. If they’re on the same machine, it won’t be a problem, but on different machines, any clock skew will effect the timing.
- I’ve quietly assumed everything in this system is reliable, which is almost certainly wrong. Whatever your error handling is, make sure you always log the exit event for ‘job’, or you may never know that you have errors!
As long as I haven’t totally dissuaded you from trying this out, all the code is available in one place in this gist, and you can try in out in your application today with our free version of TraceView!