Cloud Provider Outage Playbook: Twelve Hours, Four Stages
Cloud outages are inevitable. The teams that handle them well work from a playbook; the teams that improvise burn out.
Stage 1: Confirm
First task: confirm the outage is real and is the cloud, not you. Check the cloud status page; check your monitoring from outside the affected region; rule out internal cause.
This step is short but critical. Treating ‘a cloud outage’ as the diagnosis without confirming wastes the first hour.
Stage 2: Communicate
- Internal: post in #incidents. External: status page within 15 minutes; honest, no false claims of normal service.
- Customer email if SLA-relevant or outage exceeds 30 minutes. Templates ready in advance, not written at midnight.
Stage 3: Mitigate
What can you actually do during a region outage? Failover to a healthy region (if architected), drain traffic to standby, scale up surviving capacity.
The actions are determined by your architecture. The runbook should match what you actually have, not what you wish you had.
Stage 4: Recover
As the cloud recovers, ramp gradually. Bringing all traffic back at once will overload the recovering services and trigger a recovery cascade.
Post-incident: own your impact even though the root cause was the cloud. Customers do not care about the boundary.
Antipatterns
- No status page. Customers find out from Twitter; trust craters.
- Improvising the playbook. The first 30 minutes are the most expensive time to design a process.
- Skipping the postmortem because ‘not our fault.’ The actions you took (or did not) are still yours.
What to do this week
Three moves. (1) Pick the most exposed instance of the pattern in your environment. (2) Apply the lightest fix and measure for one week. (3) Schedule a quarterly review so the discipline does not rot.