AWS us-east-1 EBS Stuck Volumes: Postmortem of a Region-Wide Pause
EBS is invisible until it is the choke point of the entire cluster. The postmortem of any ‘us-east-1 EBS event’ reads the same way.
What ‘stuck volumes’ means in practice
EBS API calls (CreateVolume, AttachVolume, DetachVolume, CreateSnapshot) start returning slowly or queuing. Existing volumes in flight stay attached and writable; new volume creates and reattaches stall.
The user-visible result depends on what your fleet does in steady state. If you autoscale frequently, your scale-out is broken. If you cycle nodes for upgrades, your upgrades stall. If you rely on snapshot-based backups, your backup window misses.
The cascade through autoscaling
- Autoscaling groups try to add capacity in response to demand. The new instances need EBS volumes that cannot be attached. ASG retries; backs off; eventually marks instances as failed; tries new ones. None come up.
- Within 20 minutes, services that depend on burst capacity start tipping over. The original incident is ‘EBS slow’, the impact is ‘half the platform is overloaded.’
Snapshot pipelines as the second-order victim
Snapshot pipelines often run as a job that does ‘take snapshot, wait for completion, prune old.’ When CreateSnapshot returns slowly, the job times out; the prune step assumes the snapshot did not complete; the next run skips. Backup gap.
Compounding: many platforms’ cross-region snapshot replication runs from us-east-1. If the source region is degraded, the cross-region copy never starts. The DR posture for other regions weakens.
Region-risk patterns that survive the next event
- Multi-region for at least one capacity-critical service. Not the whole platform, but enough that ‘us-east-1 is degraded’ is not the same as ‘we are down.’
- Snapshot pipelines that handle slow API gracefully. Don’t assume completion; verify.
- Capacity headroom that does not require new instances. The pods you have keep running.
Antipatterns
- Single-region everything. Cheap until it isn’t.
- Backup verification by job-success-flag only. The flag lies during these events.
What to do this week
Three moves. (1) Identify the one most-critical service that should survive a us-east-1 EBS event; build a multi-region story for it. (2) Add a backup-completeness verification (read back snapshot metadata) to your nightly job. (3) Tabletop the EBS-degraded scenario at your next incident-response drill.