Real Outage: A Database Failover That Failed Over
A planned failover at a large code-hosting platform took 12 seconds longer than the timeout. The cluster manager didn’t agree on which side was primary. 24 hours of data-locality remediation followed.
Timeline
Anonymised composite drawing on the multi-region MySQL/Vitess failover patterns common at large code-hosting platforms. Times in UTC.
Day 1, 23:00, Maintenance window opens. Plan: failover the primary metadata cluster from US-East to US-West to enable network maintenance in US-East. Cluster manager (Orchestrator-style) targets a 30-second cutover.
23:00:08, Failover initiated. Old primary stops accepting writes; replica in US-West is promoted; new primary advertised in service discovery.
23:00:20, Service-discovery propagation delay. Most application clients reconnect; about 4% of clients still have stale primary addresses cached.
23:00:42, Failover declared complete by the cluster manager. Total elapsed: 42 seconds. The application-side failover timeout was set to 30 seconds.
23:00:43, Application instances that timed out at 30 seconds executed their fallback: assume the previous primary is healthy and resume writes. Those writes hit the old primary, which was now read-only. The applications saw the read-only error, retried, and a subset got routed back to the new primary via service discovery.
23:01:15, Result: 23 seconds where a small fraction of writes succeeded against US-East (the demoted primary’s buffer pool was still warm and accepting because the read-only flag had a brief propagation gap). The same writes were also re-issued and accepted at US-West.
23:08, Drift detection fires. Two regions have writes the other doesn’t. ~3,400 affected metadata rows.
Day 2, 23:08, All writes reconciled by hand. Total impact window for split data: 24 hours from detection to verified reconciliation.
The detection lag
The detection itself was good, 8 minutes from drift to alarm, on a system that was supposed to never drift. The problem was the timeout mismatch went unnoticed for 42 seconds. The cluster manager’s “30-second target” was an SLO, not a guarantee. The application team had assumed it was a hard ceiling and configured fallback behaviour that was actively dangerous when the SLO was missed.
What was missing: an alarm on “failover taking longer than expected”. The cluster manager had the data; nothing surfaced it during the operation. The on-call watching the failover saw “in progress” for 42 seconds and assumed it was normal.
The cascade
The cascade was almost a non-cascade, only ~3,400 rows out of millions ended up split. But the math was scary. The window where dual-write was possible was about 12 seconds. Application throughput on the affected metadata tables was around 4,000 writes per second. The fact that only a fraction were affected was because most clients had reconnected cleanly; the 4% with stale connections concentrated all the damage.
The deeper cascade was the manual recovery. Reconciling split-brain writes in a metadata system that includes things like permission grants, repository visibility, and access tokens is not something you can script generically. Each of the 3,400 rows had to be evaluated: which version is right, what was the user intent, are there downstream effects. Nine engineers spent 24 hours doing this by hand. That’s the real impact.
What the runbook said
The failover runbook was 14 pages and well-written. It had a checklist, a rollback procedure, a contact list, and a verification step at the end. What it didn’t have was: “If the failover takes more than 30 seconds, application clients will start dual-writing. Stop and reconcile before resuming.”
The runbook authors had assumed the application timeout was a config detail the application team owned. The application team had assumed the 30-second figure in the runbook was a guarantee. Neither team had ever sat down and verified the contract.
What actually fixed it
Recovery was three things. First, freeze writes to the affected table prefixes until reconciliation could begin (took 4 minutes). Second, run a scripted comparison between the two regions to identify drifted rows. Third, manually reconcile each row.
The script was the easy part. The reconciliation was the slog. The team built a per-row UI that showed both versions, the user who initiated the change, and the timestamps. Engineers walked through it row by row, choosing a winner, marking the loser, and writing an entry to an audit log. 24 hours of focused work across 9 people.
The team avoided naming any specific user as “at fault” in the audit. Every drift was a system failure, not a user choice.
Action items
- Fence-before-failover. The cluster manager now fences the old primary at the network layer (drops all client connections via the load balancer) before promoting the replica. No more dual-write window, full stop.
- Application timeout raised to 5 minutes. The 30-second timeout was always a foot-gun. Raising it doesn’t make failovers slower; it just removes the dangerous fallback behaviour. The user-visible impact during a 90-second failover is the same as during a 30-second one: errors with retries.
- No fallback-to-old-primary path. Application code no longer assumes the previous primary is healthy if the new one is unreachable. The new fallback is “return 503 to the user; don’t write anywhere”. Saying no is safer than dual-writing.
- Game-day on slow failovers. Quarterly drill where the team intentionally introduces a 90-second failover and verifies the new fence + raised timeout work correctly.
- Drift detection runs continuously. The 8-minute detection was already good; making it 60 seconds is incremental work and meaningful for any future event.
The architectural change
The architectural answer was: fencing is mandatory. Every failover, planned or unplanned, now starts by fencing the old primary at the network layer before any promotion happens. Application clients that were writing to the old primary get connection-reset errors and have to reconnect through service discovery. They can’t accidentally keep writing because there’s no longer a path to the old primary at all.
The pushback was “fencing adds 5-10 seconds to every failover”. The response was: those 10 seconds are the difference between a clean failover and 24 hours of manual reconciliation. The trade is obvious in retrospect.
The deeper architectural lesson, written into the team’s design-review checklist: “If a system has multiple sources of truth about which node is primary, the question is not whether they will disagree but when. Design for the disagreement.” Cluster managers, service discovery, application caches, DNS, load balancers, and individual client connection pools are all separate sources of truth about the primary. They will not agree at the same nanosecond. Fencing is what forces the disagreement to resolve safely.