Rollback as the Default Incident Response
Most incidents tied to deploys. Rollback first, investigate after. The policy and the cases where it does not apply.
The rule
Default to rollback first when a deploy correlates with the incident. Investigation runs in parallel; restoration runs first. Waiting for full understanding before acting trades customer experience for analysis comfort, and customers do not benefit from the analysis until the system is back.
- Deploy correlation triggers rollback. Deploy-window check per incident. Recent deploy plus correlated symptoms equals immediate rollback.
- Investigation in parallel. Dual-track work per incident. Do not wait for full understanding to act.
- Production restored faster. Prioritised customer experience per incident. Drives MTTR reductions directly.
- Documented decision per rollback. Explicit "we rolled back because deploy X correlated" note. Supports postmortem clarity.
When not
Two cases break the default. Rollback would cause data loss (forward-only migrations, schema changes that drop columns), or the deploy contains a non-revertable security fix that cannot be undone safely. In both cases, document the alternative forward-fix path so the on-call has a documented path forward.
- Rollback would cause data loss. Schema-or-data-regression check per rollback. Forward-only migrations break the default.
- Security fix not revertable. Security-content check per rollback. Do not undo a vulnerability fix.
- Rollback-safety tag per deploy. Documented "safe to roll back" flag per deploy. Catches unsafe cases at decision time.
- Named alternative per rollback. Documented forward-fix path when rollback is unsafe. Catches "we could not rollback so we did nothing."
Test rollback regularly
Untested rollback is theatre. The moment of crisis is the wrong time to discover the rollback procedure does not actually work. Quarterly non-prod drills verify the procedure, capture timing, surface degrading performance early.
- Quarterly non-prod rollback. Synthetic rollback exercise per quarter. Verify the procedure works.
- Untested rollback is theatre. No-untested-rollback rule per team. Crisis is the wrong time to discover it does not work.
- Documented test runs. Captured timing and outcome per test. Catches degrading rollback performance over time.
- Named test owner per quarter. Responsible engineer per quarter. Catches "we forgot to drill this quarter."