Postmortems Intermediate By Samson Tanimawo, PhD Published Dec 14, 2026 10 min read

AWS us-east-1 EBS Stuck Volumes: Postmortem of a Region-Wide Pause

EBS is invisible until it is the choke point of the entire cluster. The postmortem of any ‘us-east-1 EBS event’ reads the same way.

What ‘stuck volumes’ means in practice

EBS API calls (CreateVolume, AttachVolume, DetachVolume, CreateSnapshot) start returning slowly or queuing. Existing volumes in flight stay attached and writable; new volume creates and reattaches stall.

The user-visible result depends on what your fleet does in steady state. If you autoscale frequently, your scale-out is broken. If you cycle nodes for upgrades, your upgrades stall. If you rely on snapshot-based backups, your backup window misses.

The cascade through autoscaling

Snapshot pipelines as the second-order victim

Snapshot pipelines often run as a job that does ‘take snapshot, wait for completion, prune old.’ When CreateSnapshot returns slowly, the job times out; the prune step assumes the snapshot did not complete; the next run skips. Backup gap.

Compounding: many platforms’ cross-region snapshot replication runs from us-east-1. If the source region is degraded, the cross-region copy never starts. The DR posture for other regions weakens.

Region-risk patterns that survive the next event

Antipatterns

What to do this week

Three moves. (1) Identify the one most-critical service that should survive a us-east-1 EBS event; build a multi-region story for it. (2) Add a backup-completeness verification (read back snapshot metadata) to your nightly job. (3) Tabletop the EBS-degraded scenario at your next incident-response drill.