Schema Migrations: The Zero-Downtime Pattern

Schema migrations are the most operationally dangerous deploys. Expand-contract makes them safe.

Why naive migrations break

Naive schema migrations couple the schema change and the code change into one deploy. The window between schema deploy and code deploy is the failure window; production breaks because old code reads the new schema or vice versa.

Four-stage expand-contract

Per-stage failure modes

Each stage has a distinct failure mode and recovery path. The asymmetry matters: early stages are cheap to revert, late stages are expensive. Plan the soak time accordingly.

Rollback

Each stage is its own deploy with its own rollback path. Stage 4 only after Stage 3 has soaked for at least a week; rolling back Stage 4 means restoring data from backup, which is expensive enough to want to avoid.

Antipatterns

What to do this week

Three moves. (1) Apply this pattern to your most-loaded table. (2) Measure query latency / write throughput before/after. (3) Document the win and the constraint so the next refactor inherits the knowledge.