Schema Migrations: The Zero-Downtime Pattern

Schema migrations are the most operationally dangerous deploys. Expand-contract makes them safe.

Why naive migrations break

Naive schema migrations couple the schema change and the code change into one deploy. The window between schema deploy and code deploy is the failure window; production breaks because old code reads the new schema or vice versa.

Naive pattern. Drop column, deploy code that does not use it, hope for no concurrent writes; the bet is alignment.
Failure window. Between schema migration apply and code deploy rollout; mid-rollout, half the pods see the old schema.
Cannot rollback. Schema change is mostly irreversible; rollback requires backup restore; downtime guaranteed.
The fix. Expand-contract: every intermediate state backward-compatible; users see no change at any point.

Four-stage expand-contract

Stage 1: add new column nullable. Code writes both.
Stage 2: backfill new column.
Stage 3: code reads from new column.
Stage 4: drop old column.

Per-stage failure modes

Each stage has a distinct failure mode and recovery path. The asymmetry matters: early stages are cheap to revert, late stages are expensive. Plan the soak time accordingly.

Stage 1 (expand) fails. Easy revert; the new column is nullable and unused; drop it, redeploy code.
Stage 2 (backfill) fails. Rerun the backfill; usually a transient issue (lock timeout, throughput); the column stays in place.
Stage 3 (read new) fails. Code revert; switch the read path back to the old column; both are populated, no data loss.
Stage 4 (contract) fails. Rare but expensive; column already dropped; restore from backup; the soak time is your insurance.

Rollback

Each stage is its own deploy with its own rollback path. Stage 4 only after Stage 3 has soaked for at least a week; rolling back Stage 4 means restoring data from backup, which is expensive enough to want to avoid.

Stage-by-stage deploys. Never combine stages; each gets its own PR, deploy, and observation window.
Soak time before Stage 4. Minimum one week between Stage 3 and Stage 4; production reads from the new column the whole time.
Backup before Stage 4. Snapshot the database before dropping the old column; the rollback path of last resort.
Document the migration. Per-migration timeline committed to the runbook; the next migration learns from this one.

Antipatterns

One-step migration with downtime. User-visible.
Stage 4 same day as Stage 3. No soak.
No backfill verification. Stage 3 reads bad data.

What to do this week

Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.