Deploy Rollback Policy: The 30-Second Test
If your rollback takes longer than 30 seconds, your incidents are larger than they need to be. The fix is mechanical.
Why 30 seconds
Rollback time bounds your worst-case incident duration. A 30-second rollback caps incidents at minutes; a 30-minute rollback bounds them at hours. Most teams have rollback tested once a year, and the actual number is always larger than expected.
- Rollback bounds incident duration. The math is mechanical: if rollback takes T, incident duration ≥ T plus detection time.
- 30 seconds caps at minutes. Detection plus rollback fits inside a single SLO burn window; the user barely notices.
- 30 minutes caps at hours. Detection plus rollback plus recovery exceeds most fast-burn SLO windows; the user notices.
- Most teams test once a year. The number is always larger than expected; testing more often shrinks it.
Four properties of fast rollback
- 1. Single command. Memorable; not multi-step.
- 2. No human judgment in the rollback step.
- 3. Idempotent. Safe to re-run.
- 4. Verifiable. Confirms recovery automatically.
Rehearsal cadence
Quarterly: real rollback in staging from a representative state. Time it.
Annual: real rollback in production during low-traffic window.
Policy that prevents drift
Policy: every deploy that lands without a defined rollback path requires explicit approval.
Slow rollbacks signal complex deploys; address upstream, not in the rollback path.
Antipatterns
- Manual rollback steps. Slow; error-prone at 3am.
- Untested rollback. Discovered broken on incident day.
- Rollback that requires a senior engineer. Knowledge concentration risk.
What to do this week
Three moves. (1) Apply this to one pipeline first. (2) Measure deploy frequency / MTTR before/after. (3) Document the outcome so the next team starts from data.