The evolution of background job frameworks in Ruby

A benefit of spending a long time in the industry is context. Instead of seeing the technology that we're using today as an inevitable conclusion, you got to see how it developed over time, generation after generation as successors patched the weaknesses of their predecessors. As one common example, as obtuse as a technology like CORS appears to be at first glance, it's easier to understand when you were around at a time when it was trivially easy to send a forged cross-site request if you could find a way to inject a little JavaScript.

While helping to run the Heroku API circa 2011 to 2015, I was lucky enough to be around to witness the long evolution of async job frameworks in Ruby, a language that was my daily driver for a long time. Just for old time's sake, I thought it'd be fun to do a little time traveling and see what background jobs looked like at every step.

BackgrounDRb: The OG

With commits going back to 2008, BackgroundDRb was one of the original frameworks aimed at use with Rails. Its name "DRb" comes from "distributed Ruby", a built-in protocol that let Ruby processes invoke each other over networks, though BackgroundDRb would later drop its use.

As I was walking back through its Git history, I was surprised to find that BackgroundDRb had database-persisted jobs way back in 2008! Here's the original table definition:

table_creation = <<-EOD
  create table bdrb_job_queues(
      id integer not null auto_increment primary key,
      args               blob,
      worker_name        varchar(255),
      worker_method      varchar(255),
      job_key            varchar(255),
      taken              tinyint,
      finished           tinyint,
      timeout            int,
      priority           int,
      submitted_at       datetime,
      started_at         datetime,
      finished_at        datetime,
      archived_at        datetime,
      tag                varchar(255),
      submitter_info     varchar(255),
      runner_info        varchar(255),
      worker_key         varchar(255)
  ) ENGINE=InnoDB;
EOD
connection = ActiveRecord::Base.connection
begin
    connection.execute(table_creation)

Notably, we didn't have our elaborate modern constructs like jsonb back then, and you can even see near the bottom that MySQL's InnoDB engine is invoked specifically. This was before Postgres was widely used, and MySQL was still transitioning away from MyISAM, which was fast for reads, but didn't have support for niceties like transactions, foreign keys, or crash safety.

BackgroundDRb prescribed manual lifecycle management where implementations invoked #finish! to signal a successful run and #release_job to hand it back to the queue for re-run:

class EmailWorker < BackgrounDRb::MetaWorker
  set_worker_name :email_worker

  def create(args = nil)
    logger.info "email worker started"
  end

  def send_email(args)
    address = args[:address]
    body    = args[:body]

    logger.info "sending email to #{address}"
    EmailService.deliver(address, body)

    # Mark the job as finished so it's not picked up again.
    persistent_job.finish! if persistent_job

  rescue StandardError => e
    logger.error "failed to send email to #{address}: #{e.message}"

    # Release the job back to the queue so another worker
    # picks it up on the next poll cycle.
    persistent_job.release_job if persistent_job
  end
end

It's functional, but had a few flaws that'd lead to it being superseded by new generations of async frameworks:

The API introduced some risk of error in case the caller forgot a #finish! or #release_job.
No built-in retry mechanism. Even if implementations all handled all their errors correctly, there wasn't an easy way to assign reasonable defaults around a backoff schedule, or permanently discard chronically failing jobs.

Delayed::Job: Shopify scale

A competing framework from around the same time was Delayed::Job (DJ), extracted out of Shopify in 2008. Indeed, all the original commits from that period are from Tobias Lütke himself.

Looking at its original schema, there isn't too much to complain about. It takes a step up over BackgroundDRb in that it adds an attempts column to track the number of retries the job's needed along with a run_at column to schedule processing in the future. Permanently failed jobs have either their failed_at timestamp set, or are removed from the table:

create_table :delayed_jobs, :force => true do |table|
  table.integer  :priority, :default => 0      # Allows some jobs to jump to the front of the queue
  table.integer  :attempts, :default => 0      # Provides for retries, but still fail eventually.
  table.text     :handler                      # YAML-encoded string of the object that will do work
  table.text     :last_error                   # reason for last failure (See Note below)
  table.datetime :run_at                       # When to run. Could be Time.zone.now for immediately, or sometime in the future.
  table.datetime :locked_at                    # Set when a client is working on this object
  table.datetime :failed_at                    # Set when all retries have failed (actually, by default, the record is deleted instead)
  table.string   :locked_by                    # Who is working on this object (if locked)
  table.string   :queue                        # The name of the queue this job is in
  table.timestamps
end

DJ locks jobs and distributes them to workers, trying to lock up to five at a time (worker.read_ahead) for economy:

def reserve(worker, max_run_time = Worker.max_run_time)
  find_available(worker.name, worker.read_ahead, max_run_time).detect do |job|
    job.lock_exclusively!(max_run_time, worker.name)
  end
end

#lock_exclusively performs an atomic compare-and-swap with UPDATE ... WHERE locked_at IS NULL, getting an empty result and moving to the next row in case a worker elsewhere's already reserved it.

It's not too far off what we use today, but had a few drawbacks:

Written before we had better support for this type of use case in databases. Notably, no use of SKIP LOCKED, which would become a powerful built-in for reducing contention between database clients.
Though largely a Ruby problem that wouldn't see traction for a long time, Delayed::Job uses a classic forking model where every worker becomes its own process. This was the right thing to do at the time and aimed to work around Ruby's lack of real concurrency, but results in enormously memory-hungry apps, especially in places with large quantities of Ruby code (e.g. Shopify).

Jobs are serialized in the table's handler field as YAML. Not the end of the world, but a bit of a Ruby-ism and a little weird.

--- !ruby/object:Delayed::PerformableMethod
object: !ruby/ActiveRecord:User
attributes:
    id: 42
method_name: :activate!
args: []

Another notable aspect of Delayed::Job is that it was our first async job framework to be put to use in the API at Heroku! We'd later migrate to a number of others over the years, but DJ got Heroku all the way through its Salesforce acquisition.

Resque: Stores with types

There was a moment in those early Ruby years where Redis really took off in a big way. It was extremely fast, and differentiated itself from a traditional key/value store like memcached by supporting a tasteful collection of high level data types and operators (e.g. lists, sets, etc.) that made it flexible and fun to use. Like a conventional database, it was still fully durable and protected against crashes, but faster by virtue of the reduced bookkeeping that it had to do.

Resque was a job queue that swapped out database persistence for Redis. It spun out of GitHub, and the original series of commits were from GitHub founder Chris Wanstrath.

Using Redis, Resque was able to pull off some neat tricks like atomically pushing and popping jobs onto Redis lists. Queueing a job is two simple operations, each a fast O(1):

SADD resque:queues <queue_name>: Add a queue to a set of known queues.
RPUSH resque:queue:<queue_name> <json>: Push a new job's args onto the end of a list of queue jobs.

Getting out of the database could also be a major operational advantage around this time. It took a significant source of high throughput churn out of your valuable, canonical source of truth for everything else (making recovery of your main database faster and easier), and was less prone to operational trouble caused by a DB's transactional model that'd lead to slow locking and job discovery through database bloat.

But this strength of Resque is also its biggest weakness: by enqueueing jobs outside of other database operations, we have no idea whether a job is actually ready to work or not. Consider the classic case: a service creates a new user, enqueueing a job to send them a signup email, but the really fast Resque worker tries to work that job before the service has committed its original transaction. The user to send the email to isn't found and the job fails.

The practical way to counteract this is to write jobs that tolerate an inconsistent data model that's the inherent result of lack of transactional guarantees. Here's a sample that reenqueues itself with a backoff until it successfully finds a user:

class SendSignupEmail
  @queue = :email

  # Maximum number of times to retry when the user record isn't
  # visible yet (e.g. the creating transaction hasn't committed).
  MAX_RETRIES = 5

  # Seconds to wait before each retry attempt. Increases with each
  # attempt to give the transaction time to commit.
  BACKOFF_SECONDS = [1, 2, 5, 10, 30]

  def self.perform(user_id, attempt = 0)
    user = User.find_by_id(user_id)

    if user.nil?
      if attempt < MAX_RETRIES
        sleep BACKOFF_SECONDS[attempt] || BACKOFF_SECONDS.last
        Resque.enqueue(self, user_id, attempt + 1)
        return
      else
        raise "User #{user_id} not found after #{MAX_RETRIES} retries, giving up"
      end
    end

    SignupMailer.welcome(user).deliver
  end
end

In a bit of an odd omission, like Delayed::Job before it, Resque also doesn't bundle in a built-in retry mechanism, instead suggesting the use of a separate resque-retry gem. Modularity of this sort was a popular design feature back then, but it does add some friction to the project's use, especially for core features like retries that more or less everyone can be expected to want to use.

Queue Classic: Postgres tailor-made

Although not quite as notable as the others on this list, Queue Classic was a homegrown approach that we used for a while at Heroku.

Postgres as a database is a pretty dominant solution these days, but back then it was a lot more novel. Queue Classic's innovation was to make job throughput faster by using Postgres primitives not available anywhere else:

Listen/notify to immediately alert workers of a newly available job. This got jobs started faster and is easier on the database compared to a poll loop. Frameworks like River still use listen/notify techniques to this day.
Postgres advisory locks as a fast way for contending workers (all looking for jobs to lock) to grab jobs instead of slower methods like SELECT FOR UPDATE.

Heroku's API ran Queue Classic from 2012 to 2015 until we switched to Que.

Que: Postgres refined

Que was a nearly identical idea to Queue Classic, using Postgres-specific constructs to implement a better job queue. It came along a few years after Queue Classic (2013 compared to 2011), but was better assembled and better maintained, so we ended up moving over to it.

Though certainly advancements over the previous state of affairs, Que and Queue Classic had an Achilles heel that emerged from a not-fully-developed understanding of Postgres' MVCC model. In Postgres, table bloat occurs when dead rows are left in a table's heap because they're still visible to a transaction elsewhere in the system. QC/Que's capability to lock rows became severely diminished as table bloat increased because even though the dead rows weren't eligible to be worked, Postgres would still have to iterate through them to find live ones to lock. The problem was amplified by a model where every worker was locking its own rows rather than a single leader distributing work. One stray long-running query in the system could leave millions of dead rows in its wake and bring both these queues to their knees.

Database bloat putting pressure on database-backed job queues is a problem that's never fully been resolved, but Que improved its situation considerably by transitioning to a model where a single leader was responsible for locking jobs and sending them onto workers. Added locking difficulty from bloat was eaten only once per batch instead of by every worker in the system. Que would never go on to adopt SKIP LOCKED (I was a little surprised by this, but it's what the source code seems to show), but it'd continue to achieve something similar through the use of its Postgres advisory locks.

Sidekiq: Batteries included

Sidekiq is another Redis-based queue, but one that brought the right features at the right time and became a major hit in the wider Ruby community.

The product's key insights were that the feature set of background jobs could be substantially expanded (periodic jobs, unique jobs, rate limited jobs, etc.) and that a background jobs framework could be operationalized. In other words, users get a web UI right out of the box. Background jobs have a tendency of trending toward being one of the most important components in a system, so it was a big deal that not every separate user/company had to build out their own tooling.

GoodJob: Postgres for ActiveRecord

2020 saw the release of GoodJob, a Postgres-backed queue aimed squarely at Ruby on Rails. It calls itself a "second generation" ActiveJob backend compared to the first generation of Delayed::Job and Que.

Being Postgres centric, it uses a similar approach as Que in the use of listen/notify and advisory locks, but aimed for a simpler architecture by being only ActiveRecord compatible (though its line count has grown considerably since the 600 lines advertised in its original launch post). Interestingly, like Que it eschews SKIP LOCKED in favor of advisory locks only.

Solid Queue: The iPhone of queues

At Rails World 2023, DHH talked about how with the widespread transition to SSDs, they were having good luck replacing all their in-memory Redis caches with disk-backed MySQL instances. The switch incurred some performance cost at P50, but resulted in a massive 50% gain at P95 as their cache size capacity became effectively limitless (because it was disk instead of RAM).

Another major component would also be going back into the database. In the same keynote he announced Solid Queue, a universal job queue for Ruby on Rails.

Years earlier Rails had introduced Active Job, a unified queue API that worked with Sidekiq, Resque, Delayed::Job, etc., but Solid Queue was a major advancement in that it was a full queue implementation on its own. Given how fully-featured Rails is otherwise it was a fairly obvious next step -- Rails apps could now run background jobs with no third party software required.

Solid Queue didn't invent the genre, far from it. But like the iPhone or Google being relative latecomers to smartphones and search engines and going on to dominate their respective industries, Solid Queue took learnings from its predecessors and refined them into a polished, top-notch product:

As fully featured as Sidekiq, but unlike Sidekiq, without the additional dependency and with full transactional consistency. Also, free.
Shipped its own UI (Mission Control).
Learned the hard-won operational lessons of previous database-backed queues and uses newer database features like SKIP LOCKED to overcome them. Solid Queue uses a multi-worker, multi-threaded model where many workers are allowed, with each one locking its own jobs, but able to distribute them to multiple worker threads inside the process. This is the gold standard for concurrency in Ruby right now.

Heroku's API moved away from Rails and onto its own custom stack back around Rails 2.3, and over the years went from Delayed::Job to Queue Classic to Que. In retrospect, a better path would've been to stick to Rails (3.0 and 3.1 would fix a lot of the problems we had with it back in the early 2010s) and eventually have the async jobs make their way onto Solid Queue, but that would've been hard to predict at the time.

Where does River fit in?

Go came along later than Ruby and never quite developed the same vibrant ecosystem of async frameworks, but projects like River still take advantage of lessons learned in its sister language over the years there:

Single-dependency (i.e. database, but no Redis), and takes full advantage of transactional isolation so that jobs are worked only when they're ready.
Uses SKIP LOCKED and a single leader model where one actor is responsible for locking jobs and distributing them to workers. Both result in reduced locking.
Ships with a huge feature set (retry schedules, periodic jobs, unique jobs, etc.) and a full, first party UI.

It's not directly comparable to the projects above, but you can use it to send performance-sensitive jobs from Ruby to Go, and feature-wise, it stands up well to any of them including Solid Queue.