Graceful shutdown

While stopping, a River client tries to halt jobs as gracefully as possible so that no jobs are lost, and any that have to be cancelled will be eligible to be reworked as soon as possible. Applications using River need to pay some attention that stop is initiated correctly, that jobs are cancellable in case of a hard stop, and that jobs return an error when cancelled.

Stopping River client

A River client can be stopped with either Client.Stop or Client.StopAndCancel. Once either type of stop is initiated, the client will stop fetching new jobs and wait for existing jobs to complete. After all jobs finish up, their results are persisted to the database, and the client does some final cleanup. Stop and StopAndCancel block until done, or until the provided context is cancelled or timed out.

// Stop fetching new work and wait for active jobs to finish.
if err := riverClient.Stop(ctx); err != nil {
    panic(err)
}

// Same as the above, but instead of waiting for jobs to finish of their own
// volition, cancels their work context so they finish more quickly.
if err := riverClient.StopAndCancel(ctx); err != nil {
    panic(err)
}

The difference between the two stop functions is that StopAndCancel immediately cancels the work context of running jobs. It still waits for jobs to return and still persists the result, but jobs are expected to terminate more quickly with their context cancelled.

Even in the event of a hard stop (StopAndCancel), it's still important for the client to persist results so that the cancelled jobs can be picked up by another client to be worked as soon as possible.

Designing cancellable jobs

The Go programming language is designed in such a way that no goroutine can kill another. Instead, concurrency constructs are used to pass messages to other goroutines that instruct them to terminate. One of those concurrency constructs are contexts, which are inherited across all components in a Go app in a tree structure, and can be used to pass information or a cancellation signal. If a context high up in the tree is cancelled, all inherited contexts are cancelled as well, which gives a Go program a way of stopping as all goroutines throughout the process respond to cancellation by exiting cleanly.

River is built entirely around the idea of context cancellation. Each worker's Work function receives a context as its first argument, and is expected to needle that context into all subsequent invocations that it makes. In the event of a hard stop via StopAndCancel, the context is cancelled and active jobs are expected to notice and return.

Many of the low-level components in Go already respect context cancellation and will return an error naturally, so as long as user code is respecting returned errors it doesn't need to do any additional work. For example, an HTTP request through net/http will return context.Canceled as long as the worker's context was threaded into the request (be careful to use NewRequestWithContext instead of NewRequest):

resp, err := http.DefaultClient.Do(req)
if err != nil {
    return err // will return context.Canceled
}

The same generally goes for database drivers, SDKs, and other types of network communication. Context cancellation is respected at a low level, and will bubble back through user code will minimal effort.

However, there are cases where user code needs to be careful to respect context cancellation in its own right, especially around sends and receives on channels. Take the simplest example, a channel receive:

item := <-myChan // WRONG

A send on myChan might eventually be received by this code, but in the interim if the job's context is cancelled, it won't stop the job. This can be corrected by rewriting the code with a select to handle both conditions:

select { // RIGHT
    case item := <-myChan:
    case <-ctx.Done():
        return ctx.Err()
}

To ensure jobs can be cancelled quickly, all channel receives or sends on blocking channels should be in a select statement alonside a receive on ctx.Done().

In the event of a cancelled context, the code block above returns context.Canceled. This is to ensure that in the case of job cancellation, an error is written to the database and the job isn't accidentally lost (returning a nil counts as a success). The job will be picked up by another client or the next time one is available.

Cancelled jobs must return an error

In the event of cancellation, jobs must return ctx.Err() or another error. Failing to do so would cause their result to be marked as a success (even if the client is stopping), and the job wouldn't be worked again. An errored job can be picked up by another client or the next time a client is available to be worked again. See retries.

Stuck programs

A goroutine can't terminate another goroutine, so in the event of a job that doesn't respect context cancellation, calls to Stop and StopAndCancel may hang forever.

Robustly designed programs should either have a supervisor terminate a process stuck on Stop or StopAndCancel after an appropriate timeout, or stop waiting on them.

Care should be taken to try and prevent this from happening because failing to wait on stop runs the risk of River exiting uncleanly, meaning that it may not have been able to persist the result of running jobs as it's shutting down, leaving them in running state. These jobs will eventually be rescued so they can be reworked, but not for an hour (see Config.RescueStuckJobsAfter), and their work will be considerably delayed.

All effort should be made to wait on stop

Applications using River should make all efforts to wait on Stop or StopAndCancel. Not doing so may leave jobs in running state, which won't be rescued for an hour, thereby causing considerable delay.

Jobs that force termination by not respecting cancellation and blocking StopAndCancel should be diagnosed posthaste to correct the problem.

Realistic shutdown code

See River's graceful shutdown example for what a realistic shutdown procedure might look like.

SIGINT/SIGTERM initiates soft stop, giving running jobs a chance to finish up.
After a second SIGINT/SIGTERM or 10 second timeout, a hard stop is initiated, instructing jobs to terminate immediately by cancelling their work contexts.
After a third SIGINT/SIGTERM or 10 second timeout, stops waiting and exits immediately.

Use of similar code would be appropriate for both local development, where a developer sending Ctrl+C (SIGINT) would start a soft stop and a second Ctrl+C do a hard stop, or on a platform like Heroku, which will send a SIGTERM and give programs 30 seconds to finish up (thus the 10 second timeouts for each phase).

sigintOrTerm := make(chan os.Signal, 1)
signal.Notify(sigintOrTerm, syscall.SIGINT, syscall.SIGTERM)

go func() {
    <-sigintOrTerm
    fmt.Printf("Received SIGINT/SIGTERM; initiating soft stop (try to wait for jobs to finish)\n")

    softStopCtx, softStopCtxCancel := context.WithTimeout(ctx, 10*time.Second)
    defer softStopCtxCancel()

    go func() {
        select {
        case <-sigintOrTerm:
            fmt.Printf("Received SIGINT/SIGTERM again; initiating hard stop (cancel everything)\n")
            softStopCtxCancel()
        case <-softStopCtx.Done():
            fmt.Printf("Soft stop timeout; initiating hard stop (cancel everything)\n")
        }
    }()

    err := riverClient.Stop(softStopCtx)
    if err != nil && !errors.Is(err, context.DeadlineExceeded) && !errors.Is(err, context.Canceled) {
        panic(err)
    }
    if err == nil {
        fmt.Printf("Soft stop succeeded\n")
        return
    }

    hardStopCtx, hardStopCtxCancel := context.WithTimeout(ctx, 10*time.Second)
    defer hardStopCtxCancel()

    // As long as all jobs respect context cancellation, StopAndCancel will
    // always work. However, in the case of a bug where a job blocks despite
    // being cancelled, it may be necessary to either ignore River's stop
    // result (what's shown here) or have a supervisor kill the process.
    err = riverClient.StopAndCancel(hardStopCtx)
    if err != nil && errors.Is(err, context.DeadlineExceeded) {
        fmt.Printf("Hard stop timeout; ignoring stop procedure and exiting unsafely\n")
    } else if err != nil {
        panic(err)
    }

    // hard stop succeeded
}()

<-riverClient.Stopped()