Concurrent repack: Vacuum full without the pain

Table bloat is an operational problem in Postgres as old as Postgres itself. There's a serious kind of bloat stemming from rows that are deleted but not yet vacuumed, which can lead to degenerate performance in high throughput tables. It's manageable by monitoring for long running queries and letting vacuum do its job, but takes vigilance so as not to lead to query degradation.

There's another kind of bloat too. In Postgres, vacuum marks space as reusable for new inserts/updates, but never fully reclaims it. It's not as big of a problem most of the time as tables tend to trend upward in disk requirements (so sooner or later, it'll be used), but in cases where a table has gotten very large and since shrunk to be very small, it leaves considerable wasted space.

Here's a table with 0 rows that occupies 275 MB:

river_dev=# SELECT count(*) FROM river_job;
 count
-------
     0

river_dev=# SELECT pg_size_pretty(pg_total_relation_size('river_job'));
  pg_size_pretty
----------------
  275 MB

A VACUUM FULL can solve the problem through a table rewrite, but VACUUM FULL is one of the blunt tools that has to be avoided in a real production database. It takes an ACCESS EXCLUSIVE lock, so as long as it runs all other operations on the table are blocked: all deletions, insertions, updates, and even basic reads, leading to total degradation and downtime.

We've since seen some improvements in this situation like the introduction of REINDEX CONCURRENTLY (not to be confused with REPACK (CONCURRENTLY), though it looks similar), which rebuilds an index without an exclusive lock, and is very helpful because indexes suffer from the same fragmentation problem that tables do, and in many cases can (in aggregate) occupy as much space as the table itself.

But the core problem of the fragmented table has been with us for 20+ years now. When I was in the managed Postgres industry and a client asked how to reclaim space from a large table in a production database, it was always a little embarrassing to have to say, "well, you can't" (or perhaps more accurately, "if you value uptime, you shouldn't").

Concurrent repack

A very exciting upcoming Postgres 19 feature is the addition of REPACK (CONCURRENTLY) from Antonin Houska. Though accomplishing an outcome similar to VACUUM FULL, it's implemented in a considerably more production-friendly way:

Launches a background worker that sets up logical decoding to listen for table changes.
Gets an initial snapshot, creates a new heap with identical structure, and copies visible tuples from one to the other.
Replays logical decoding changes to the new heap until it's caught up.
Takes ACCESS EXCLUSIVE lock, does final catch-up pass, swaps the old heap for the new.

Unlike VACUUM FULL, the ACCESS EXCLUSIVE lock is short-lived, allowing normal database operations to continue for all but a short period near the end while the repack is running.

Three-step heap migration diagram showing copy, replay, and cutover

Example with River

Let's show a quick example with River. First we'll run a River benchmark that seeds its jobs table with a million rows, then we'll remove them with a DELETE FROM. Immediately afterwards, river_job is still enormous:

river_dev=# DELETE FROM river_job;
DELETE 1000000

river_dev=# SELECT pg_size_pretty(pg_total_relation_size('river_job'));
  pg_size_pretty
----------------
  775 MB

Running a normal VACUUM strips the table of dead rows, reducing its size considerably, but can't reclaim the physical space they once occupied:

river_dev=# VACUUM river_job;
VACUUM

river_dev=# SELECT pg_size_pretty(pg_total_relation_size('river_job'));
  pg_size_pretty
----------------
  275 MB

VACUUM FULL solves the problem completely, but as discussed above, it's never suitable to run in production due to the heavy-handed exclusive lock it holds for the duration of its run:

river_dev=# VACUUM FULL river_job;
VACUUM

river_dev=# SELECT pg_size_pretty(pg_total_relation_size('river_job'));
  pg_size_pretty
----------------
  80 kB

Now here's where we get to the good part. After resetting the table with more jobs, we can see that REPACK (CONCURRENTLY) will produce the same effect, but safely:

river_dev=# REPACK (CONCURRENTLY, VERBOSE, ANALYZE) river_job;
INFO:  repacking "public.river_job" in physical order
INFO:  "public.river_job": found 0 removable, 0 nonremovable row versions in 64246 pages
DETAIL:  0 dead row versions cannot be removed yet.
CPU: user: 0.09 s, system: 0.07 s, elapsed: 0.17 s.
INFO:  analyzing "public.river_job"
INFO:  "river_job": scanned 0 of 0 pages, containing 0 live rows and 0 dead rows; 0 rows in sample, 0 estimated total rows
INFO:  finished analyzing table "river_dev.public.river_job"
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
buffer usage: 77 hits, 0 reads, 0 dirtied
WAL usage: 0 records, 0 full page images, 0 bytes, 0 full page image bytes, 0 buffers full
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
REPACK
Time: 1515.646 ms (00:01.516)

river_dev=# SELECT pg_size_pretty(pg_total_relation_size('river_job'));
  pg_size_pretty
----------------
  80 kB

With the VERBOSE and ANALYZE options enabled, Postgres tells us that it iterated 64,246 pages while repacking, all of which was dead space as there were no live rows in the table.

In a busy database

To demonstrate the difference more fully, I ran VACUUM FULL while River's benchmarks were running. You can see throughput dip to 0 jobs/sec while the vacuum held an exclusive lock:

bench: jobs worked [      19216 ], job/sec [     9608.0 ] [2s]
bench: jobs worked [          0 ], job/sec [        0.0 ] [2s] <-- VACUUM
bench: jobs worked [          0 ], job/sec [        0.0 ] [2s]     running
bench: jobs worked [      54478 ], job/sec [    27239.0 ] [2s]
bench: jobs worked [     110523 ], job/sec [    55261.5 ] [2s]

The outage here only lasts ~4s because we're dealing with ~500 MB of data which Postgres can rewrite quickly, but 500 MB is small for a production table. In a real-world situation it could last a lot longer and cause serious downtime.

Doing the same with REPACK (CONCURRENTLY) shows a dip in throughput as the repack is running, but it never goes to 0. All insert/update/delete/select operations are allowed to continue normally:

bench: jobs worked [      62037 ], job/sec [    31018.5 ] [2s]
bench: jobs worked [      12011 ], job/sec [     6005.5 ] [2s] <-- REPACK
bench: jobs worked [       4005 ], job/sec [     2002.5 ] [2s]     running
bench: jobs worked [      30017 ], job/sec [    15008.5 ] [2s]
bench: jobs worked [      80038 ], job/sec [    40019.0 ] [2s]

The panacea of high churn?

I'm jumping the gun writing about REPACK (CONCURRENTLY) because it's not expected to ship until Postgres 19 later this year (at the soonest), but it's such a game changing feature that I couldn't help but try it in advance. We plan to have River ready to make use of it on day one with an opt-in repacker service that'll repack high throughput tables on a regular schedule.

Job queues aside, Postgres has had a problem for years in that any high throughput table — tasks, events, audit logs — can develop significant bloat, and short of VACUUM FULL there was never a good answer for what to do about it. Concurrent repack will be a long-awaited first party solution at last, and it's exciting. The biggest operational upgrade to land in Postgres in quite some time.