Go in 42 lines of Ruby

Go’s Goroutines are like sugar. They give me a rush, and my IO-bound work as well!

But for a one-off script, my Go’s not fluent enough, so why not bring some of it to Ruby?

I’m writing a script that scrapes a slow API pretty intensely. The naive approach would have me do this:

things_ids.each do |thing_id|
  data = HttpClient.get("http://example.com/things/#{thing_id}")
  save_to_file(data)
end

… but I have 3,000 things to get, the API takes 1 second per call, and I’m not that patient.

(For the record, the API in question is Pivotal Tracker’s)

So what I’d really like to do is something like:

things_ids.each do |thing_id|
  go do
    data = HttpClient.get("http://example.com/things/#{thing_id}")
    save_to_file(data)
  end
end

and have it magically parallelize my go block.

Sure, I could use Celluloid’s beautiful futures, or Faraday’s parallelism support, but today I have a moment to tinker under the hood.

All I really need is a pool of threads, futures, and some syntax sugar:

class Pool
  def initialize(max_size = nil)
    @threads = []
    @lock    = Monitor.new
    @signal  = ConditionVariable.new(@lock)
    @max_size = max_size
  end

  def go(&block)
    @lock.synchronize do
      @signal.wait while !available?
      @threads << Thread.new do
        Thread.current.abort_on_exception = true

        begin
          value = block.call
        ensure
          release Thread.current
        end
        value
      end
      return @threads.last
    end
  end

  private

  def release(thread)
    @lock.synchronize do
      @threads.delete thread
      @signal.broadcast
    end
  end

  def available?
    @max_size.nil? || @threads.length < @max_size
  end
end

Our pool just has one go method. Pass it a block, and it will spawn a thread to do your work (possibly waiting on the pool to have a free slot).

abort_on_exception makes sure that if one of your “goroutines” bursts into flames, you’re told (as opposed to the default of failing silently).

The @lock / @signal dance keeps your pool size under control; specifically the @signal will wake up waiting for a slot in the pool whenever one becomes available.

Threads in Ruby can be used as promises/futures, and the go method returns a thread object (leaky API, but bear with me):

go { :hello }.value
#=> :hello

The shiny

I almost forgot the syntax sugar that lets you say go in your favourite Ruby REPL:

def Kernel.go(&block)
  ($pool ||= Pool.new).go(&block)
end

Adding the go method directly on Kernel means it’ll be globally available abolutely everywhere in your VM. Probably not the best idea, but hey, Ruby lets you do it!

Caveat emptor

There are a few major caveats here, so please don’t do the above in production code. Experts have built better libraries for this, this is for educational purposes, don’t say you weren’t warned:

  • Threaded code is bloody hard to integration test, as is everything with threads and parallelism.
  • If you limit the pool size, you’ll get deadlocks rapidly unless you very carefully think about what you’re spawning.
  • If you don’t, your machine may or may not catch fire (or in my case, your API token may well get banned!)

Threads v processes

At HouseTrip, we generally avoid the use of threading for scaling purposes, and follow the Unix Way (and, incidentally, the 12 factor way) of scaling out through the process model.

But cases exist where concurrency (not scaling) is useful. The example I’ve used (calling on a bunch of external resources and processing the results) is a typical one. Another classic, which will sound more familiar to web engineers, is firing database queries “in parallel” before rendering your views.

To elaborate: in this concurrency example you want the “threads” to interact, so they can for instance share/reuse HTTP connections, and then provide an aggregate result.

In a “scaling out with threads” example, examples, the threads interact (you don’t have a choice—the memory and object space is shared) but it’s actually a problem as one thread has the ability to corrupt or disrupt all others.

Tools of the thread

Here’s some pointers to get you started, should you want to leverage concurrency in your Ruby VM:

  • Ilya Grigorik has built agent, a much more elaborate (and tested) version of the above.
  • Another classic library, offering the Actor model in a more traditional OO fashion, is Celluloid.
  • For a different approach to the same problem, check out async/await in .NET 4.5. Always good to see how the other side does it.

That’s all, go have fun with threads now!

photo of Julien Letessier

Software Engineer
comments powered by Disqus