Rollback vs Roll-Forward
Two recovery strategies.
Rollback brings the old version back
Rollback restores the last known-good state. It is the fastest, most predictable recovery path when the failure is real and the fix is not obvious.
- Speed. The previous version is in the registry; deploy is mechanical; minutes to recovery.
- Predictability. The old version was serving production five minutes ago; behaviour is known.
- Required when. Bug is severe, the fix is non-trivial, or the team does not yet understand the failure.
- Cost. Forward progress lost; any data written under the new version may not be readable by the old if schema changed.
Roll-forward fixes the bug live
Roll-forward ships a fix on top of the failing version. It is the right call when rollback is harder than fixing, but it concentrates risk during an active incident.
- Best for. Trivial bugs with obvious fixes, schema-breaking changes that cannot easily revert, time-sensitive features.
- Risk. A second change goes through the deploy pipeline mid-incident; the second change can fail too.
- Stress factor. Under pressure, engineers ship buggy fixes; 'roll forward' becomes 'roll forward into a worse state'.
- Required reviewers. Pair-programmed or two-eyes review even more strictly than normal; do not skip CI.
How to decide
The default is rollback. Roll-forward is the exception, and the exception needs to be argued for explicitly in the incident channel.
- Default to rollback. The known-good state is safer than the unknown new state.
- Roll-forward when. Rollback would lose user data, schema cannot be reverted, or fix is genuinely trivial and reviewed.
- Document the decision. Note in the incident timeline why rollback was rejected; postmortems will reference it.
- Time-box. Set a 15-minute roll-forward window; if the fix is not landing cleanly, fall back to rollback.
Schema-aware rollback
Schema changes break the rollback story. Design migrations so code can roll back independently of data; the upfront cost pays back the first time you need it.
- Forward-compatible migrations. Add columns, never remove; old code still reads new schema.
- Backward-compatible code. Do not drop columns until all readers have moved; deprecate first, delete later.
- Two-phase deploys. Schema change ships separately from code change; rollback of code does not require rollback of data.
- Cost. More PRs, longer migration windows; the rollback freedom is worth the friction.
Operational rules
Rollback only works if the team has done it before. Practice and automation turn it from a 3am scramble into a one-button operation.
- Practice quarterly. Rollback drill in staging; the first real rollback should not be the first attempt.
- Auto-rollback. SLO-regression-triggered rollback for low-blast-radius services; manual decision for high-blast.
- Runbook entry. Rollback procedure documented per service; on-call should not invent it during a sev1.
- Rollback budget. Track how often rollback fires; recurring rollbacks are a CI/test-quality signal, not a rollback problem.