River includes a number of auxiliary services for features and queue maintenance. These perform functions like cleaning cancelled, completed, and discarded jobs from the database, periodically rebuilding indexes to optimize performance, and rescuing stuck jobs. One client at a time runs maintenance services, determined by leader election.
River leaves cancelled, completed, and discarded jobs in the database even though they're at the end of their lifecycle so that operators can introspect them, but if left to accumulate forever, they'd eventually grow the jobs table to a point where its size would consume excessive storage and impact performance.
To prevent the jobs table growing without bound, River includes a job cleaner process that periodically prunes old jobs. It wakes up periodically and deletes completed, cancelled, and permanently failed (discarded) jobs if their last attempt was beyond the retention period.
The default retention periods vary by job state, and each is configurable in
CancelledJobRetentionPeriod, defaults to 24 hours.
CompletedJobRetentionPeriod, defaults to 24 hours.
DiscardedJobRetentionPeriod, defaults to 7 days.
- Sleeps until the next run time of the first job that'll come due.
- Runs the job and any others that'll come due within a small margin of now.
- Recalculates the next run time of all jobs that ran.
- Goes back to sleep and repeats the cycle.
The periodic enqueuer has no configuration aside from its assigned periodic jobs.
Reindexer is disabled
The reindexer service needs more vetting before it's distributed broadly, and is effectively disabled for the moment.
The reindexer works periodically to issue a
REINDEX INDEX CONCURRENTLY to rebuild certain key job indexes. In most situations reindexing isn't expected to improve performance, but it can help in some degenerate cases like where a glut of jobs had at one point bloated the B-tree index and subsequently left it with many empty or nearly empty pages. In such situations Postgres' indexes will never "collapse" of their own accord, but a
REINDEX to rebuild them from scratch produces a new index with the live rows and without the empty space.
The reindexer rebuilds one index at a time in order to not put an undue amount of stress on the database.
By default the reindexer runs every day at midnight UTC, but it can be customized through
Config.ReindexerSchedule with a custom scheduling function. Like with periodic jobs, a cron package can be used to succinctly define a complex schedule.
The reindexer has an early implementation, but is a prospective feature that's effectively disabled (it runs, but rebuilds zero indexes). There are some fine details in reindexing like making sure to leave only valid indexes behind and not reindexing too aggressively that need to be tested thoroughly before the feature's enabled broadly.
The rescuer looks for "stuck" jobs and either enqueues them to be reworked, or discards them if they've hit their maximum allowed attempts. A job may become stuck in situations like:
- A bug. Think of a job that waits on a channel to which nothing will ever send to, and which isn't using a
selectto respect context cancellation. The job waits for something that will never happen. The client will eventually try to cancel it according to its
Config.JobTimeoutconfiguration, but because the job can't be cancelled, nothing happens, and it'll only end once its parent processs is terminated. It's important to design jobs to be cancellable to avoid this problem.
- After the job ran, there was a problem persisting its new state to the database. This problem should be rare, and can be avoided completely with the use of transaction job completion.
The duration after which a job is considered stuck and eligible for rescue can be configured with
Config.RescueStuckJobsAfter. Its value:
- Defaults to one hour, or
JobTimeoutplus one hour in case
JobTimeouthas been configured to be larger than one hour.
- Must be greater than
JobTimeoutif both it and
The rescuer bounds job duration
Config.RescueStuckJobsAfter is effectively an upper bound on how long jobs are allowed to run, because jobs that are still running after this duration will be rescheduled to run again, potentially alongside an existing execution attempt for the same job.
Jobs can be scheduled to run in the future for several reasons:
- At insertion time, a
ScheduledAttime was specified in the job's
- A worker may have snoozed the job to run again in the future.
- The job may have errored on a previous execution and needs to be retried after a backoff duration.
The scheduler executes at a constant interval. Each time it runs, it queries for jobs that are ready to be attempted again and makes them
available. The scheduler runs every 5 seconds and is not configurable.