New Relic on Heroku: Measuring Queue Time Using Ruby Agent
The Cloud Zone is brought to you in partnership with Mendix. Better understand the aPaaS landscape and how the right platform can accelerate your software delivery cadence and capacity with the Gartner 2015 Magic Quadrant for Enterprise Application Platform as a Service.
Curator's Note: The content of this article was written by Sam Goldstein on the New Relic blog .
On March 5, we released version 3.5.7 of the New Relic Ruby agent, which contains several changes to how Request Queueing is reported. In sum, these changes provide a more accurate interpretation of queue time data for Ruby applications running on Heroku. (For more information on frontend time reporting, please see the documentation in our support site.)
What is Request Queueing?
Request Queuing, which shows up as a green band in New Relic’s response time overview graph, represents the time after a request initially hits your production infrastructure but before your Ruby application process receives it. For example, it could represent the time after nginx or HAProxy begins processing a request but before a Rails controller action begins processing it.
This metric can be very useful for identifying certain types of performance problems, common in Ruby applications. For example, it helps expose when there are too few Rails worker processes and nginx is having to wait for one to become available. It can also show the effects of a deploy, where frequently you’re waiting for Rails processes to respawn and nginx is holding the requests until an app worker becomes available.
Since the Ruby agent lives in your Rails or Ruby application we can’t measure queue time directly. There’s no way for the agent to intercept a request before the application (and thus the agent) has received it. Therefore we rely on the frontend web server to set an HTTP header which contains a timestamp of when it was received. When the request reaches the application we calculate the delta between the reported timestamp and the current time and report this duration as Request Queue Time. We recommend setting this timestamp as soon after the request hits your infrastructure to avoid masking performance problems that occur before the header is set.
Late last week we were notified by our customers at RapGenius that the Ruby agent seemed to be under-reporting Request Queuing for applications hosted on Heroku. Heroku sends applications two pieces of information that can be used by our agent to report queue time. The first header they send is a X-Heroku-Queue-Wait-Time which represents the number of milliseconds spent in their routing code. The second header they send is X-Request-Start whose value is a timestamp represented as the number of milliseconds since the Unix epoch.
Previous versions of the agent (i.e. 3.5.6 and earlier) reported the value in X-Heroku-Queue-Wait-Time as Request Queuing, under the assumption that this was the full duration spent in Heroku’s infrastructure before reaching the application. In fact this value only represents time spent in Heroku’s routing system, which is typically a small fraction of the total Request Queueing time. In the 3.5.7 release we ignore this header and we no longer provide a mechanism to report a Request Queue duration directly to the agent. We now require the Request Queueing duration to be calculated in the agent.
We’ve also improved the agent’s ability to parse and interpret X-Request-Start and X-Queue-Start headers (which are treated identically). Previous versions of the agent required that the value of these headers be formatted as t=MICRO_SECONDS_SINCE_EPOCH. On Heroku the X-Request-Start header is sent as the number of milliseconds (not microseconds) since epoch, without the leading t=. The latest version of the Ruby agent can now reliably interpret timestamps sent in milliseconds, microseconds, or seconds. If the timestamp includes a decimal point we assume seconds with a fractional unit (this makes configuration of nginx much easier). Otherwise we determine the unit based on the order of magnitude, since a millisecond timestamp, interpreted as microseconds would result in a >40 year queue time.
The bottom line is that upgrading to version 3.5.7 of the Ruby agent should provide more accurate reporting of time spent in Request Queueing for apps running on Heroku. We’ve independently verified that queue time is reported correctly on our test Heroku apps, by timing request made to these apps and subtracting time spent in the application as well as network latency.
Limitations in Request Queue Time Reporting
Although it provides an extremely valuable performance metric, two factors have the potential to affect the accuracy of New Relic’s Request Queueing feature.
Clock Skew: If the frontend web server (e.g. nginx) and your application don’t reside on the same physical server it is possible for reported Request Queuing to be effected by clock skew. While ntp provides an excellent way to keep server clocks in sync, they still will drift relative to each other. Since the Ruby agent relies on a timestamp set by the frontend server it may over or under-report Request queuing if the clock on that server is not closely synchronized with the clock on the app server. While this may seem like a major problem with the feature, clock skew is unlikely to result in sudden spikes in reported Request Queueing. Sudden spikes generally occur when an app is restarted or becomes overloaded with requests. Our experience is that Request Queueing reporting can be very useful for identifying real performance problems, but you should consider clock skew when interpreting this data.
Dependency on Header: The other major limitation of this feature is that it requires data collected by the frontend web server, which is passed along in the header. Since the agent can’t time a request that the app has not yet received it has to trust that the header passed by the frontend server is accurate. In this release we’ve minimized the likelihood of collecting and processing erroneous header data by automatically interpreting the time units (as described above) and allowing the t= prefix to be omitted. In our continuous effort to provide the most accurate data to our customers, we’ve also been working very closely with the Heroku team and other partners to ensure the fidelity of both the information passed to New Relic and our interpretation of it.
Tools for Visibility
At New Relic we strive to provide tools that highlight performance problems that might otherwise be invisible. Request Queue reporting can be extremely valuable in exposing certain types of performance problems. We appreciate the attention the Ruby community has brought to this feature and their impact on Heroku apps.
Sign up here. It's free, so why not?