AWS us-east-1 2021
Multi-service outage.
Overview
The December 2021 us-east-1 outage was a multi-hour, multi-service AWS incident that took down core services in the region and most third-party tools that ran there. Many AWS internal services depended on us-east-1; even the console became unreachable. The case study reshaped how teams design for regional resilience: multi-AZ is not enough when the failure domain is the region itself.
- Single-region single point of failure. Many AWS internal services depended on us-east-1. The customer architecture lesson is the same: do not put a tier-0 dependency in one region.
- Console dependency. AWS console itself became unreachable. Incident response could not rely on the console.
- Cascading service failure. Network impairment took down services with implicit network dependencies. Transitive failures spread further than direct dependencies suggest.
- Multi-AZ insufficient plus SLA financial impact. The failure domain was the region, not the AZ; major customers triggered SLA credits. Resilience requirements escalated industry-wide.
The approach
Multi-region for critical services (active-active or hot-standby), alternative consoles via CLI from outside the affected region, dependency mapping that catches transitive failures, region-aware monitoring that runs from a different region, game-day exercises that test regional failure before it happens.
- Multi-region for critical services. Active-active or hot-standby. Survives regional outages by construction.
- Alternative consoles. CLI from outside the affected region. Incident response continues even when the primary console is down.
- Dependency mapping. Know what depends on what across regions. Catches transitive failures before they cascade.
- Region-aware monitoring plus game-days. Monitoring runs from a different region; quarterly game-days test region failure. Real validation, not theoretical resilience.
Why this compounds
Each architecture review that applies the lesson reduces regional-dependency risk. Multi-region matches enterprise compliance requirements and unlocks regulated markets. The team's resilience muscle grows from "we hope us-east-1 stays up" to deliberate failure-domain design.
- Reduced incident impact. Multi-region designs survive regional outages. Real uptime under worst-case.
- Better incident response. Alternative consoles keep response moving. MTTR drops on the incidents that matter most.
- Compliance readiness. Multi-region matches enterprise requirements. Regulated markets open up.
- Year-one investment, year-two habit. First multi-region service is the investment; subsequent services inherit the patterns.