What we learned chasing replication lag in Postgres 16
One of our larger customers had a Postgres 16.2 cluster with three async replicas. For two weeks they were chasing a phantom 8–12 second lag on one specific replica that always cleared itself within five minutes. Their on-call told us it “probably wasn’t important” but they wanted to understand it. So did we.
The temptation: blame disk
The instinct is always disk: “the replica is on slower NVMe, the WAL apply can’t keep up”. We checked. iostat showed <15% utilization on the replica during lag windows. pg_stat_wal_receiver showed flush_lag < 50 ms. The replica was receiving WAL just fine; it just wasn’t replaying it.
The actual cause
It was a large analytical query on the replica. A nightly job ran a 12-minute aggregation against a 1.2 TB orders table. Postgres’s hot_standby_feedback=on kept the replication slot open while that query ran, but the apply WAL replay was actually pausing on a DROP INDEX CONCURRENTLY issued earlier on the primary. The replay couldn’t take the AccessExclusiveLock because the long query was holding AccessShareLock on the same table.
So the timeline was:
- 00:00 — nightly aggregation starts on replica, takes AccessShareLock on orders.
- 00:04 — primary issues a DROP INDEX CONCURRENTLY on orders.
- 00:04 — WAL arrives at replica with the lock-acquire record. Replay blocks.
- 00:12 — aggregation finishes, lock released, replay catches up in < 5 s.
How we found it
Three queries pointed straight at the answer:
-- 1) is replay actually waiting on a lock? SELECT * FROM pg_stat_activity WHERE backend_type = 'startup' AND wait_event IS NOT NULL; -- 2) what's holding the conflicting lock? SELECT pid, mode, locktype, relation::regclass, granted FROM pg_locks WHERE NOT granted; -- 3) on the primary, what was the WAL doing at that timestamp? SELECT pg_walfile_name(pg_current_wal_lsn());
The fix was unglamorous. We split the analytical query into ten smaller transactions of about 70 seconds each (each one within max_standby_streaming_delay), and we now schedule schema changes for a window that doesn’t overlap nightly reporting. Lag has been < 200 ms ever since.
The general lesson
Replication lag in modern Postgres is almost never about hardware. The replay process is single-threaded, and any AccessExclusive lock the primary takes (CREATE INDEX, DROP INDEX, VACUUM FULL, CLUSTER, REINDEX) will pause replay if a replica session is holding even a low-grade lock on the same object.
If you’re seeing “mystery lag” that clears itself in 5–15 minutes, before you start tuning checkpoint_timeout or buying more NVMe, just look at pg_stat_activity on the replica during the lag window. Nine times out of ten the answer is one row away.
Got a Postgres mystery you can’t crack? Send it our way.