Transactional enqueueing

Transactional enqueueing is a key benefit of River. This model avoids several common failure modes of background jobs in a distributed application. It reduces the time spent investigating or engineering around distributed systems edge cases, and results in a simpler architecture.

Failures with two primary stores

The alternative to River's approach of putting a job queue in your main database is the traditional model of having two data stores — a primary database and a secondary store like Redis where jobs are enqueued. While largely functional, the two store approach can lead to data loss on the edges that's nearly impossible to fully reconcile.

Enqueue after transaction

Imagine building a user signup flow. The frontend submits an email and password to the backend application's POST /users route. This request opens a database transaction to insert the user record into Postgres, which completes successfully. Then the backend application attempts to enqueue a job in Redis to send a welcome email to the user. ️⚡️ Zap!—the server just lost power. ⚡ ️ That user will never receive a signup confirmation email.

This chain of events might sound familiar to any seasoned backend developer. If it's not a power loss, it could be a program panic, a network interruption, or any number of other failure modes that are possible when coordinating between two independent data stores (Postgres and Redis).

While such events may sound unlikely, in practice they turn out to be a regular frustration, especially at nontrivial scale.

Enqueue before transaction completes?

In previous example, the developer tried to enqueue the job after the primary database transaction completed. This ensured that the database changes were committed atomically (all at once or none at all), but it left open the possibility of the subsequent jobs being enqueued. What if the developer tried the opposite approach, and enqueued the job in Redis before the Postgres transaction commits?

Naturally, this developer also built their Redis job worker in Go. Because their worker is so fast, so it managed to pick up the new job in only a couple a milliseconds. As the worker queries the database to load the user record from the database by its ID, they hit an error — it seems the user does not exist in the database yet.

The diligent developer notices an error in their exception tracker and immediately digs in. They are puzzled to see that the POST /users request was successful, yet somehow their worker could not find the user record in the database. How could that be?

The answer is that the job was fetched from the Redis queue before the Postgres database transaction committed the new user record, and thanks to rules around transaction visibility, the worker could not yet see that row when it queried for it. Or maybe the API encountered a subsequent error which caused the transaction to rollback and the user record was never actually committed. Or maybe the server encountered another power failure before it could commit.

A simpler model

Transactional enqueueing solves all of the above problems, and it does so without needing to operate an additional service outside the primary Postgres database. When you enqueue a job in River, you can do so in a transaction with any other changes you're making such as inserting a user record or adding a corresponding profile record. This means that when a worker picks up a job, it can rely on the fact that any data it depends on was already committed along with the job itself.

When you build your system around transactional enqueueing, you spend less time tracking down and patching around distributed systems edge cases and more time focusing on building what matters. In the past this model was held back by poor implementations or Postgres limitations, but this is no longer the case: a modern Postgres job queue can easily scale to tens of thousands of jobs per second.

We believe this should be the default model for building reliable systems, appropriate for all but the very largest applications.