Timing Rack Middlewares with metaprogramming, recursive monkey-patching, and a sprinkle of statistics
Analysing performance of your Rails or Sinatra app is easy enough with New Relic, but figuring out whether the soft outer shell of your stack is under-performing is more of a challenge. We’ve written rack-timer to figure things out.
The most typical tool Ruby web stacks use to monitor runtime performance is New Relic. It does a great job at spotting what happens inside transactions, which database queries are slow, etc. It also reports on something called queue time, as the infamous green slab at the bottom of its graphs.
Misleadingly, queue time it not request queuing: it’s the delay between when a requests hits your web frontend (Apache, NGinx) and the beginning of your action. Thus, it includes three things:
- time spent in your HTTP dispatcher (Passenger, Unicorn)
- time spent in your Rack middlewares
- time spent in
before/after_filters(if using Rails).
We had a hunch that something was awry in the outer layers of our stack, and neither New Relic nor any beautiful gem gave us any intel.
rack-timer aims at solving that by providing timings for dispatch time and middlewares.
It works in a borg-like way: add it to the top of your middleware stack (if
using Rails, even before
Rack::Lock), and it will “assimilate” every other
middleware by injecting its code recursively into every other middleware.
RackTimer::Middleware initializer looks like this:
def initialize(app) @app = app self.extend(Borg) end
Starting the chain reaction, it injects the
Borg in itself. Then, once the
middleware if part of the “collective” (simplified code):
module Borg def self.extended(object) object.singleton_class.class_eval do alias_method :call_without_timing, :call alias_method :call, :call_with_timing end object.instance_eval do recursive_borg end end def recursive_borg @app.extend(Borg) end
Borg wraps the
call method of each middleware it’s injected into with
time logging, then injects itself in the next middleware down the stack
At runtime, the
call wrappers will transparently call the original
then output timing information.
Within our biggest application, we let 40 out of 520 workers work with the
place until they’d collected information about 10,000 requests—enough to provide
us with statistically significant data.
rack-timer defaults, logs go to standard error, i.e. Apache’s
error.log. We grabbed those, grep’d for the timer’s output, whipped some
sed(1) magic along the lines of
$ bzcat error.web[1-4].bz2 | \ grep '^\[rack-timer\]' | \ sed -e 's/^.rack-timer. //; s/ took /,/; s/ us$//' \ > middlewares.csv
and voilà, raw data ready to be digested. Good old Excel then graphed the middleware timings for us:
The reason we graph both median and mean is that the latter is sensitive to outliers, whereas the median is a “robust metric”. A significant discrepancy between the two usually hints either at a skewed, non-normal distribution, or more typically presence of extreme outlier.
None of this is visible here. Not only the middleware all respond consistently, but the aggregate median time spent in middlewares is (despite their number) a low 7.5ms.
Note the left-most pair of bars in that graph—that’s the tail of the middleware chain, i.e. the application itself (including filters).
Moving on to the queue timings, we can apply more
sed magic. This time
Excel doesn’t cut it, but R is probably an old
friend to anyone serious with performance analysis, and stats in general.
Distribution of the queueing timings used to look like this:
This is clearly bimodal: the left mode (clustered around 1ms) is expected. Passenger does need to do some work do move requests around, and 1ms is fine.
The second mode is much more worrisome: its overall area (i.e. the overall number of requests spent in that failure mode) is about half of the total, and it’s fairly progressive (not a nice bell curve at all), hinting at some kind of step/non-linear phenomenon.
The conclusion was that something what causing progressively more queuing in some cases.
That’s when we remembered that
- We’d added out-of-band garbage collection to relieve the Ruby VM back in the 1.8 days
- We’d recently upgraded Passenger to 3.0+
It turned out our out-of-band garbage collection hack (based on rack_after_reply) was no longer compatible with Passenger: what used to be run out-of-band was now run in-band with th next request on a particular worker.
Removing the out-of-band garbage collection solved the issue:
On top of that we’ve cut our average response time by a further 15 to 20%, and obviously suffer less from the spikiness in response time due to random GC hits.
And finally, having a proper excuse to write an arguably dangerous metaprogramming / recursive monkey-patching combo was fun!