Rollback vs Roll-Forward

Two recovery strategies.

Rollback brings the old version back

Rollback restores the last known-good state. It is the fastest, most predictable recovery path when the failure is real and the fix is not obvious.

Speed. The previous version is in the registry; deploy is mechanical; minutes to recovery.
Predictability. The old version was serving production five minutes ago; behaviour is known.
Required when. Bug is severe, the fix is non-trivial, or the team does not yet understand the failure.
Cost. Forward progress lost; any data written under the new version may not be readable by the old if schema changed.

Roll-forward fixes the bug live

Roll-forward ships a fix on top of the failing version. It is the right call when rollback is harder than fixing, but it concentrates risk during an active incident.

Best for. Trivial bugs with obvious fixes, schema-breaking changes that cannot easily revert, time-sensitive features.
Risk. A second change goes through the deploy pipeline mid-incident; the second change can fail too.
Stress factor. Under pressure, engineers ship buggy fixes; 'roll forward' becomes 'roll forward into a worse state'.
Required reviewers. Pair-programmed or two-eyes review even more strictly than normal; do not skip CI.

How to decide

The default is rollback. Roll-forward is the exception, and the exception needs to be argued for explicitly in the incident channel.

Default to rollback. The known-good state is safer than the unknown new state.
Roll-forward when. Rollback would lose user data, schema cannot be reverted, or fix is genuinely trivial and reviewed.
Document the decision. Note in the incident timeline why rollback was rejected; postmortems will reference it.
Time-box. Set a 15-minute roll-forward window; if the fix is not landing cleanly, fall back to rollback.

Schema-aware rollback

Schema changes break the rollback story. Design migrations so code can roll back independently of data; the upfront cost pays back the first time you need it.

Forward-compatible migrations. Add columns, never remove; old code still reads new schema.
Backward-compatible code. Do not drop columns until all readers have moved; deprecate first, delete later.
Two-phase deploys. Schema change ships separately from code change; rollback of code does not require rollback of data.
Cost. More PRs, longer migration windows; the rollback freedom is worth the friction.

Operational rules

Rollback only works if the team has done it before. Practice and automation turn it from a 3am scramble into a one-button operation.

Practice quarterly. Rollback drill in staging; the first real rollback should not be the first attempt.
Auto-rollback. SLO-regression-triggered rollback for low-blast-radius services; manual decision for high-blast.
Runbook entry. Rollback procedure documented per service; on-call should not invent it during a sev1.
Rollback budget. Track how often rollback fires; recurring rollbacks are a CI/test-quality signal, not a rollback problem.