Rollback Validation: Did It Work?
Rollback isn't done until it's verified. The validation.
Metrics
Validation starts with the metric that triggered the rollback. Confirm it returned to baseline (within minutes, not hours) before declaring resolution. Slow recovery means the rollback did not address the actual cause; declaring "resolved" on partial recovery is how incidents come back ten minutes later with the same shape.
- Affected metric to baseline. Trigger metric back at pre-incident levels per rollback. Data-driven resolution.
- Within minutes, not hours. Fast-recovery expectation per rollback. Slow recovery signals the rollback did not fix the cause.
- Dashboard link in the incident channel. Metric chart per rollback. Everyone working from the same data.
- SLO impact closed. Error-budget recalculation per rollback. Keeps SLO accounting honest.
Smoke tests
Smoke tests catch what the metric misses. The trigger metric returning to baseline does not prove the user-facing flow works; some bugs land below the alerting threshold but break checkout. Run affected user flows (synthetic before real where possible) per rollback.
- Run affected user flows. Manual or synthetic exercise per rollback. Real users or synthetic that exercises the path.
- Catches subtle regressions. User-experience check per flow. Metric returns to baseline; flow may still be subtly broken.
- Named flow set per service. Documented critical-path tests per service. Catches "I forgot to check checkout" mistakes.
- Synthetic before real. Canary on internal traffic per rollback. Reduces customer impact during validation.
Monitor
Monitoring after rollback is its own discipline. Some regressions emerge slowly via cache effects, queue drains, or downstream impact; the watch must extend past metric recovery. Named monitor owner plus temporarily tighter post-rollback alarms catches resurfacing bugs in the first 30 minutes.
- 30+ minutes after rollback. Extended watch window per rollback. Do not declare resolved on the first minute of recovery.
- Slow regressions watched. Latent-issue watch per rollback. Cache effects, queue drains, downstream impact emerge over time.
- Named monitor owner per rollback. Responsible engineer per rollback. Catches "I assumed you were watching" gaps.
- Post-rollback alarms tightened. Temporary tighter alarms per rollback. Catches resurfacing bugs early.