Deploy Rollback Discipline
Rollback should be 1 command.
Speed targets
Rollback speed is customer-impact duration. Every minute the bad version is live is a minute of customer pain; the speed bar should be a published number, not a vibe.
- Under 60 seconds for application deploys. Anything slower turns a small regression into an extended outage; treat 60s as the bar, not the goal.
- Under 5 minutes for infrastructure rollback. Database migrations and config changes rollback slower; bound the target so the team plans for it.
- Untested rollback is theatre. The first rollback should not happen in production. Drills validate the runbook before incidents need it.
- Published speed targets per team. Visible numbers force honest progress tracking; targets that nobody reads do not improve.
Automation patterns
Three automation patterns cover almost every rollback case: one-command rollback, auto-rollback on metric breach, and human override always available alongside both.
- One-command rollback.
kubectl rollout undo, vendor API, or scripted equivalent; the team should not need to remember syntax during an incident. - Auto-rollback on metric breach. Argo Rollouts, Flagger, or vendor canary tools watching SLO metrics; catches regressions before humans see them.
- Human override always available. Auto-rollback can be wrong; the on-call IC needs a manual escape valve that beats automation.
- Documented automation per deploy. Named rules, named thresholds, named owners; "we have auto-rollback" without the spec is half a feature.
Testing rollback
Quarterly drills are how rollback stays real. The procedure decays without exercise; the first time you run it cannot be the time the customer is paged.
- Quarterly drills. Non-prod rollback exercise on a fixed cadence; verify the procedure works and the team remembers it.
- Documented procedure per deploy type. Step-by-step copy-pasteable runbook tested in the drill, kept current as deploys evolve.
- Rollback failure is an incident. If a rollback fails in production, the postmortem is mandatory; broken rollbacks are higher severity than the original deploy bug.
- Captured timing per drill. Duration measurement on every drill; degrading rollback performance is an early warning.
Constraints and trade-offs
Three constraints limit rollback in real systems: schema migrations, external API contracts, and stateful services. Plan accordingly so the deploy that needs to roll back can.
- Database migrations limit rollback. Forward-compatible migrations (add column, dual-read/write, enforce, remove old) preserve rollback ability; destructive migrations end it.
- External API contracts limit rollback. Breaking changes to public APIs cannot un-break; downstream consumers cannot return to the old contract.
- Stateful services may not roll back cleanly. Once state has migrated, roll-forward is sometimes the only path; acknowledge it in the deploy plan.
- Documented rollback path per deploy. Explicit "can we roll back this change" check at deploy time; surprise during incidents is the failure mode to avoid.
Operating rollback discipline
Rollback discipline is operational, not theoretical. Track the metrics, run the drills, and treat regression rollbacks as inputs to deploy hardening.
- Time-to-rollback metric. Per-deploy timer; track p95 alongside MTTR. Slow rollback is its own problem to solve.
- Rollback rate. Weekly rollback-deploy count; healthy is low and predictable, trending up means deploy quality is degrading.
- Per-rollback postmortem when caused by regression. Each regression rollback teaches the deploy gate where the test was missing.
- Quarterly rollback retro. Cause review across the quarter; patterns drive deploy-pipeline improvements.