Schema Migrations: The Zero-Downtime Pattern
Schema migrations are the most operationally dangerous deploys. Expand-contract makes them safe.
Why naive migrations break
Naive: drop column; deploy code that doesn’t use it; pray.
Expand-contract: every step backward-compatible; users see no change.
Four-stage expand-contract
- Stage 1: add new column nullable. Code writes both.
- Stage 2: backfill new column.
- Stage 3: code reads from new column.
- Stage 4: drop old column.
Per-stage failure modes
Stage 1 fails: easy revert (column nullable; data not used).
Stage 2 fails: rerun backfill.
Stage 3 fails: code revert.
Stage 4 fails: rare; hard to undo (column dropped).
Rollback
Each stage is its own deploy. Stage 4 only after Stage 3 has soaked for a week.
Rolling back Stage 4 means restoring data from backup; plan accordingly.
Antipatterns
- One-step migration with downtime. User-visible.
- Stage 4 same day as Stage 3. No soak.
- No backfill verification. Stage 3 reads bad data.
What to do this week
Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.