Rollback as the Default Incident Response

Most incidents tied to deploys. Rollback first, investigate after. The policy and the cases where it does not apply.

The rule

Default to rollback first when a deploy correlates with the incident. Investigation runs in parallel; restoration runs first. Waiting for full understanding before acting trades customer experience for analysis comfort, and customers do not benefit from the analysis until the system is back.

When not

Two cases break the default. Rollback would cause data loss (forward-only migrations, schema changes that drop columns), or the deploy contains a non-revertable security fix that cannot be undone safely. In both cases, document the alternative forward-fix path so the on-call has a documented path forward.

Test rollback regularly

Untested rollback is theatre. The moment of crisis is the wrong time to discover the rollback procedure does not actually work. Quarterly non-prod drills verify the procedure, capture timing, surface degrading performance early.