Blue-green schema migrations without breaking production
One of the things that comes up almost every week on customer calls is some variation of: “We need to add a column with a default value to a 400 GB table, and our last attempt locked the table for 23 minutes.” Here’s the playbook we use.
The rule
Every schema change goes through three deploys, never one:
- Expand — add the new shape, leave the old shape working.
- Migrate — backfill data, dual-write from the application.
- Contract — remove the old shape only after no reader / writer touches it.
The mistake teams make is doing all three in the same deploy because “the migration is just one ALTER”. It’s never just one ALTER.
Adding a column the boring way
Postgres 11+ made ADD COLUMN ... DEFAULT O(1) for non-volatile defaults. So that one’s easy now:
ALTER TABLE orders ADD COLUMN region text DEFAULT 'eu' NOT NULL;
This takes milliseconds even on a 400 GB table. The default is stored as table metadata until rows get written.
Adding an index without locking
The boring approach:
CREATE INDEX CONCURRENTLY idx_orders_region ON orders (region) WHERE region IS NOT NULL;
Two things to know:
- You cannot run CREATE INDEX CONCURRENTLY inside a transaction. Disable transactional DDL in your migration tool, or use a separate runner.
- If it fails partway through, you’re left with an INVALID index. REINDEX INDEX CONCURRENTLY or just DROP INDEX and retry.
Renaming a column without downtime
Old name: user_email. New name: contact_email. The migration is four deploys, not one:
- Add contact_email, populate via trigger from user_email.
- Backfill historic rows in 1000-row batches.
- Update application code to read+write contact_email; trigger now copies the other direction.
- Drop the trigger, drop user_email.
Yes, it’s tedious. It’s also the only way to keep your service available the whole time.
Bloat: pg_repack vs VACUUM FULL
For tables that have been hammered with UPDATEs and need a real cleanup, never use VACUUM FULL on production. It takes an AccessExclusiveLock and rewrites the whole table inline. Use pg_repack instead. It runs concurrently with reads and writes and only takes a brief lock at the end to swap the relfilenode.
The replica gotcha
If you’re replicating to async standbys, every AccessExclusive lock you take on the primary will pause replay on the replicas. We covered this in detail in a previous post. The TL;DR: schedule schema changes outside of long-read windows, and consider max_standby_streaming_delay values that match your tolerance.
We’ll do this for you on Pro tier — just say the word.