Best Practices Advanced By Samson Tanimawo, PhD Published Mar 10, 2026 7 min read

Database Migrations Without Downtime: The Patterns That Hold

Most schema changes can be zero-downtime if you decompose them into a sequence of compatible steps. The hard part is the discipline to actually do the sequence instead of one big migration during a maintenance window.

The cost of a downtime window

The midnight maintenance window is cheap until you count the engineers who stayed up, the customers in other time zones who were affected, and the inevitable thing that goes wrong because you only do this twice a year so nobody is fluent. Zero-downtime migrations cost more upfront and less every time after.

The math at scale. A team that migrates 10 services per year via maintenance windows runs 10 late-night sessions, each requiring 3-4 engineers and a Saturday. That's 120-160 engineer-hours plus customer disruption. The same team that does zero-downtime migrations runs them during business hours; ~40 engineer-hours total. The first ZDM is more expensive (you have to learn); migrations 2-10 are dramatically cheaper.

The cultural benefit. Teams that do ZDM stop fearing migrations. The fear of migrations is what blocks needed schema changes; teams accumulate years of "we should rename this column" tech debt because each rename requires a maintenance window. ZDM unlocks the schema work the team has been putting off.

Expand-contract

The pattern under almost every zero-downtime change. Step 1 (expand): add the new shape alongside the old one. Step 2: dual-write. Step 3: backfill. Step 4: dual-read with the new shape preferred. Step 5 (contract): remove the old shape. Each step is independently deployable and reversible.

The five-step pattern is what makes ZDM tractable. Each step is a small deploy that's easy to verify and easy to roll back. Comparing to the alternative (a single big migration with the application code changed at the same time), expand-contract trades complexity-per-step for simplicity-of-each-step. The total work is similar; the risk is dramatically lower.

The discipline of separation. Each step is its own deploy. Each step is its own pull request. Each step is verified before moving to the next. Teams that try to combine steps to "save time" reintroduce the risk that expand-contract was designed to eliminate.

Adding a new column

The simplest case. Add the column nullable, deploy code that writes to it, backfill existing rows, then deploy code that reads from it. Postgres and most modern databases handle a nullable add as a metadata-only change.

The metadata-only optimisation is critical. Adding a NOT NULL column with default values rewrites every row of the table; on a 100M-row table this can take hours and locks writes. Adding a nullable column is metadata-only — milliseconds, no lock. Always add nullable first; promote to NOT NULL later via the technique below.

The backfill step is where teams cut corners. Bulk UPDATE on a hot table can lock for minutes. The right pattern is batched backfill — UPDATE 1000 rows at a time, sleep 100ms between batches, run for hours if needed. The backfill is slow on purpose; the slowness is what keeps the production load humming during the migration.

Renaming a column

Never rename in place on a hot table. Add the new column, dual-write to both, backfill the new column, switch reads to the new column, then drop the old. Five deploys. Worth it.

The reason in-place rename fails. ALTER COLUMN takes a write lock on the table; on a busy table this means thousands of queries timing out for the duration. Even Postgres' fast metadata-only ALTER COLUMN RENAME blocks while running, and existing prepared statements (cached query plans) reference the old name and fail. Five deploys avoids both problems.

The dual-write pattern in detail. Application writes to BOTH old and new column. Reading prefers the new column when populated, falls back to the old. After backfill completes, switch read order. After confirming reads from new are stable, stop dual-writing. Finally, drop the old column. The sequence is mechanical; deviating from it usually reintroduces a race.

Adding a NOT NULL constraint

The trap. Adding NOT NULL on an existing nullable column locks the table in many databases. Pattern: add a CHECK constraint with NOT VALID, validate it later (this is online), then promote to NOT NULL when validation completes. The whole sequence stays online.

The mechanism. CHECK ... NOT VALID adds the constraint but doesn't immediately scan all rows; new writes are validated against it. ALTER TABLE ... VALIDATE CONSTRAINT scans the rows online (no write lock), confirming each row passes. Once the constraint is validated, ALTER COLUMN SET NOT NULL is a metadata-only operation. The whole sequence avoids the table lock.

The discipline. Don't skip the NOT VALID step. The temptation is "I know all rows are populated, just SET NOT NULL directly." But Postgres scans the table to verify; on a large table that scan is a multi-minute lock. Always go through CHECK NOT VALID + VALIDATE CONSTRAINT first.

Splitting a wide table

Hardest of the common cases. Add the new tables, write to both, backfill, switch reads, remove the columns from the old table. Plan on a 2-4 week sequence for a busy table; do not try to compress it into a sprint.

Why splits take weeks. Each step requires production verification before moving to the next. The dual-write step requires the application to write consistently to both stores; a single missed write breaks invariants the next step depends on. The backfill of a 100M-row table runs for days. The read-switch requires careful monitoring. Compression of these steps is where outages live.

The verification discipline. After each step, write a verification query that confirms the invariant the step established. "After dual-write, every new row exists in both tables" is a SQL query the team can run continuously. The query should return zero discrepancies; if it returns any, the step isn't done.

When downtime is genuinely required

Two cases. Reorganising primary keys with foreign-key cascades. And database-engine swaps where the data shape simply does not map cleanly. Everything else has a path.

The primary-key change is genuinely hard because every foreign-key reference must update simultaneously. Online tools exist (pt-online-schema-change, gh-ost) but they have their own tradeoffs and don't handle all cases. For these, a maintenance window is honest.

The engine swap (Postgres → MySQL, MongoDB → DynamoDB) usually requires a custom migration tool. Even with dual-write, validating that the two stores agree on millions of rows is non-trivial. Plan a 3-6 month effort, including a parallel-run period where both engines serve reads to confirm equivalence.

Common antipatterns

The "rename in code, alias in DB" shortcut. Engineer creates a view that aliases old name to new name, treats migration as done. The view is now part of the schema; removing it later is its own migration. Just do the five deploys.

Skipping the dual-read verification. Team backfills the new column and switches reads in the same deploy. The first time the new column is wrong, production is broken. Always have an interval where both are read and discrepancies are logged.

Migrations without rollback. The migration changes the schema irrevocably (e.g., DROP COLUMN). If the new code has a bug, rollback isn't possible. Drop columns LAST, after the new code has been stable for at least a sprint.

The big-bang weekend migration. Team plans to do all five steps over a weekend. Step 3 (backfill) takes longer than expected; team rushes to step 4; step 4 has a subtle bug; production is broken Monday morning. Steps must be paced; don't compress the schedule.

What to do this week

Three moves. (1) Identify your most-overdue schema change (the one your team has been putting off). Apply expand-contract; commit to the 5-deploy sequence over the next month. (2) Document the team's ZDM playbook in your wiki. The first ZDM is hard; the second is easy when the playbook exists. (3) Add a query that monitors invariants for the in-progress migration. The query is the safety net that catches "step N is broken" before it becomes an outage.