Rollback Strategies: What Actually Reverts a Bad Deploy
The rollback button you have not tested in six months is not a rollback button. It is a hope. Real rollback is a specific sequence the team has done in production this quarter.
The rollback drill nobody does
Most teams have a rollback procedure on paper and have not executed it in production in months. The first real attempt under pressure inevitably reveals that the procedure depended on a credential that expired, a snapshot that was deleted, or a script that was rewritten without testing. The procedure was theatre. The actual rollback is improvisation.
The cost of improvisation under pressure. A planned rollback takes 2-5 minutes; an improvised one takes 20-60. The difference is multiplied by the customer impact during those minutes. A single bad rollback during a busy outage adds tens of thousands of dollars in customer-trust cost over what a tested procedure would have cost.
The fix is a quarterly rollback drill. Pick a non-critical service, deploy a tagged "rollback me" version, page yourselves, and execute the procedure. The cost is one quiet afternoon. The savings are every real outage you handled in 8 minutes instead of 40.
Binary revert
Re-deploy the previous version. The simplest rollback, and the only one that does not need a separate playbook. Limit: it does not undo data changes. If the new version migrated the database, "deploy v-1" leaves you with v-1 code talking to a v schema. Often broken.
The schema-mismatch failure. Most teams discover this during their first rollback drill. The new code added a column the old code doesn't write; the old code's queries now leave the column NULL, breaking invariants. Or the new code dropped a column the old code reads; the old code now errors. Both are real, both have happened in production at most teams.
The mitigation. Database migrations stay backward-compatible for at least one version. You don't drop the old column when you add the new; you drop it three releases later. The discipline is what makes binary revert work safely; without it, binary revert is a foot-gun.
Traffic shift
You kept the old version running, and the rollback is a load-balancer change to send traffic back. Faster than re-deploy, narrower blast radius. Works when you have blue/green capacity. Cost: you are paying for two environments.
The blue/green pattern in detail. Two complete environments (blue and green); active environment serves traffic; inactive one is standby. Deploy lands in the inactive one; traffic flips when ready. Rollback is the same flip in reverse — instant, no rebuild required.
The cost of running two environments full-size is large. Many teams run blue/green only for the application tier, sharing data layer; the rollback then requires the data layer to support both versions. Others use scaled-down standby (smaller fleet) and scale up before the flip; introduces a 5-10 minute warmup. Choose based on RTO target and infrastructure budget.
Feature-flag flip
The fix is one config change away. Works because you wrapped the new behaviour in a flag. The fastest rollback when it applies; when it does not (e.g., the bug is in code that runs regardless of flag state), it does not help.
The discipline. Wrap risky changes in flags by default; not just product features but also performance optimisations, third-party-library swaps, infrastructure migrations. Each flag is a future rollback option. The flag inventory grows; that's fine — flags are cheap, the protection is valuable.
The flag's cleanup discipline. After a successful 100% rollout that's been stable for a sprint, the flag should be removed and the old code path deleted. Without cleanup, flags accumulate as technical debt; with cleanup, they stay as a tool that's used and then put away.
Schema reverse
The hard one. The migration that rolled forward needs to roll backward. The expand-contract pattern lets you stop at any step; rollback is "undo the last step." Pure forward-only migrations have no clean rollback path; you are choosing between data loss and forward fixes.
The expand-contract advantage. At each step, the previous version's code still works against the current schema (you maintained backward compatibility). Rollback is just deploying the previous code; the schema's expanded shape supports both old and new code. Without expand-contract, the new code's schema changes are entangled with code changes, and rollback breaks one or the other.
The hard cases. A migration that destroyed data (dropped a column, deleted rows). Restoring the data requires a backup; the backup has cost (time to restore) and may have lag (last backup was 4 hours ago, you'll lose 4 hours of writes). Plan for these specifically; they're the migrations where rollback is genuinely impossible without data loss.
When the rollback window closes
Most rollbacks are easy in the first hour and progressively harder. After 6 hours, customers have done writes against the new version that the old version cannot understand. After 24 hours, you are usually rolling forward to fix bugs, not rolling back to dodge them. Decide early.
The decision criteria. In the first hour, default to rollback for any unexpected behaviour. After 6 hours, weigh: how much data has been written that's incompatible with the old version? How long would the forward-fix take? Rollback usually loses, but only because the cost of losing 6 hours of new-version data exceeds the cost of fixing forward.
The mistake. Engineers in the first hour debate forward-fix versus rollback. The right move is rollback — it's faster, the impact is bounded, and the team can come back tomorrow with a fix. Forward-fix in the first hour means the bug stays in production while the team is debugging.
Quarterly drills
Do a real rollback in production every quarter. Pick a non-critical service, deploy a tagged "rollback me" version, page yourselves, and execute the procedure. The cost is one quiet afternoon. The savings are every real outage you handled in 8 minutes instead of 40.
The drill protocol. Pick a service. Deploy a known-bad version that triggers a specific alert. Wait for the on-call to receive the page. The on-call executes the documented rollback procedure. Time the steps. Document what worked and what didn't.
What you learn from a drill. The credentials your runbook references have rotated. The git tag you reference doesn't exist anymore. The Terraform state your rollback depends on has drifted. The deploy tool's UI changed. Each is a small thing that becomes blocker during a real incident; each is something the drill catches before the real incident.
Common antipatterns
"We rolled forward instead." The team encounters a regression, decides to fix forward instead of rolling back. Sometimes correct; often a way to avoid admitting the rollback procedure is broken. If you rarely roll back, the procedure has rotted.
The rollback procedure that requires the original deploy engineer. Procedure says "Sara has the credentials and knows the steps." Sara is on vacation. The procedure must be executable by any on-caller; if it requires specific knowledge, document the knowledge.
Forward migrations without backward compatibility. Schema changes that drop or rename in the same version that uses the new shape. No clean rollback. Always preserve backward compatibility for at least one version.
The "small" deploy that's actually a migration. Engineer ships a "small fix" that includes a schema change; treats it as a normal deploy. The fix needs rollback — which now requires the schema reverse procedure. Migrations and code changes should be separate deploys.
What to do this week
Three moves. (1) Schedule the next quarterly rollback drill on the team calendar. The recurring event is what makes it happen; ad-hoc good intentions don't survive Q3. (2) Audit your last 10 deploys: for each, what was the rollback path? Most teams find 1-2 deploys had no clear rollback — those are the next migrations to fix. (3) Document the rollback procedure for your most critical service. Two pages max. Test it during the next drill.