Cloud & Infrastructure Intermediate By Samson Tanimawo, PhD Published Dec 9, 2026 10 min read

Cloud Provider Outage Playbook: Twelve Hours, Four Stages

Cloud outages are inevitable. The teams that handle them well work from a playbook; the teams that improvise burn out.

Stage 1: Confirm

First task: confirm the outage is real and is the cloud, not you. Check the cloud status page; check your monitoring from outside the affected region; rule out internal cause.

This step is short but critical. Treating ‘a cloud outage’ as the diagnosis without confirming wastes the first hour.

Stage 2: Communicate

Internal: post in #incidents. External: status page within 15 minutes; honest, no false claims of normal service.
Customer email if SLA-relevant or outage exceeds 30 minutes. Templates ready in advance, not written at midnight.

Stage 3: Mitigate

What can you actually do during a region outage? Failover to a healthy region (if architected), drain traffic to standby, scale up surviving capacity.

The actions are determined by your architecture. The runbook should match what you actually have, not what you wish you had.

Stage 4: Recover

As the cloud recovers, ramp gradually. Bringing all traffic back at once will overload the recovering services and trigger a recovery cascade.

Post-incident: own your impact even though the root cause was the cloud. Customers do not care about the boundary.

Antipatterns

No status page. Customers find out from Twitter; trust craters.
Improvising the playbook. The first 30 minutes are the most expensive time to design a process.
Skipping the postmortem because ‘not our fault.’ The actions you took (or did not) are still yours.

What to do this week

Three moves. (1) Pick the most exposed instance of the pattern in your environment. (2) Apply the lightest fix and measure for one week. (3) Schedule a quarterly review so the discipline does not rot.