The Multi-Region Failover Runbook
Multi-region failover is a high-stakes operation. The runbook structure that produces consistent, safe failovers.
Failover decision
Multi-region failover is the discipline of moving traffic from a degraded region to a healthy one. The procedure must be both fast (incidents do not wait) and reliable (failover that itself fails compounds the problem). The runbook is the structured procedure that makes failover repeatable. The decision layer is what triggers the runbook.
What good failover decision criteria look like:
- Documented criteria.: The runbook specifies what triggers failover. Latency above threshold for a window; error rate above threshold for a window; region health check failures; AWS-reported regional incidents. The criteria are explicit; ambiguity in the moment of incident leads to slow or wrong decisions.
- Latency, error rate, region health.: The signals are specific. Customer-facing latency above 5 seconds for 10 minutes; error rate above 5% for 5 minutes; AWS service health dashboard showing severe issues. Each is a defined metric with a defined threshold.
- Decision authority.: A bounded set of people can declare failover. On-call engineer, incident commander, security or operations leadership. The set is small enough that the decision can be made quickly; large enough that one person's unavailability does not block.
- Who can declare?: The list is documented and current. New on-call engineers know they are on the list; people who left are removed. The clarity prevents the "I thought you were going to call it" failure mode.
- Bounded set.: The bounded set prevents both indecision and overreaction. A single person can declare; the team does not need to assemble a committee mid-incident.
Decision criteria and authority are what make the runbook usable in real time. Without them, the runbook is reference documentation; with them, it is operational guidance.
The steps
The failover steps are the mechanical procedure. Each step has a specific action, an expected outcome, and an expected duration. Following the steps in order produces the failover; deviations are themselves signals worth investigating.
- Promote standby database.: The standby region's database is promoted to primary. The promotion procedure depends on the database (RDS failover, manual primary promotion, application-level reconfiguration). The team has practiced this step; the duration is known.
- Update DNS or LB.: Traffic routing is updated to point at the new primary region. Route 53 weighted records, load balancer reconfiguration, or application-level routing change. The change is fast; the propagation is bounded by DNS TTLs.
- Verify traffic shift.: The team verifies traffic is actually flowing to the new region. Real-time metrics show the shift; latency and error rates from the new region match expectations. Without verification, the team does not know the failover succeeded.
- Drain old region.: Active connections to the old region are drained. New connections go to the new region; old connections complete and disconnect. The drain is graceful; users do not see abrupt disconnects.
- Each step has expected duration.: The runbook documents how long each step should take. Deviations from expected duration are signals: the step is harder than expected, something is wrong, the runbook needs updating. The deviations are themselves data.
The steps are the mechanical execution. Practice produces familiarity; familiarity produces fast, accurate execution under stress.
Rollback
Failover is reversible. After the original incident is resolved, traffic can return to the original region. The rollback procedure is sometimes more complex than the original failover because data has accumulated in the failed-over region during the incident.
- Steps to fail back.: The runbook documents the rollback. Re-establish replication from the new primary back to the original; verify replication catches up; reverse the failover steps. The procedure is the mirror image of the original failover.
- Often more complex than the original failover.: The original failover is "promote the cold standby"; the rollback is "merge the data accumulated in the new primary back into the original primary, then reverse the cutover". The complexity comes from data reconciliation.
- Practice both directions in drills.: Drills exercise both the failover and the rollback. The team is comfortable with both directions; either is operationally feasible. Without practice, the rollback is the failure point that turns a recovered incident into a longer outage.
- Document data reconciliation.: If reconciliation is required (because the failed-over region wrote data the original did not see), the procedure is documented. The team knows which data flows need reconciliation and how.
- Postmortem after rollback.: The full incident, including failover and rollback, is postmortemed. What worked? What was harder than expected? What runbook updates are needed? The postmortem feeds the next iteration.
Multi-region failover runbook is the discipline that distinguishes regions that can actually fail over from regions that just look like they can. Nova AI Ops integrates with regional health data, runs failover drills, and produces the runbook-backed playbook that incident commanders reference during real events.