Postmortems Intermediate By Samson Tanimawo, PhD Published Dec 14, 2026 10 min read

AWS us-east-1 EBS Stuck Volumes: Postmortem of a Region-Wide Pause

EBS is invisible until it is the choke point of the entire cluster. The postmortem of any ‘us-east-1 EBS event’ reads the same way.

What ‘stuck volumes’ means in practice

EBS API calls (CreateVolume, AttachVolume, DetachVolume, CreateSnapshot) start returning slowly or queuing. Existing volumes in flight stay attached and writable; new volume creates and reattaches stall.

The user-visible result depends on what your fleet does in steady state. If you autoscale frequently, your scale-out is broken. If you cycle nodes for upgrades, your upgrades stall. If you rely on snapshot-based backups, your backup window misses.

The cascade through autoscaling

Autoscaling groups try to add capacity in response to demand. The new instances need EBS volumes that cannot be attached. ASG retries; backs off; eventually marks instances as failed; tries new ones. None come up.
Within 20 minutes, services that depend on burst capacity start tipping over. The original incident is ‘EBS slow’, the impact is ‘half the platform is overloaded.’

Snapshot pipelines as the second-order victim

Snapshot pipelines often run as a job that does ‘take snapshot, wait for completion, prune old.’ When CreateSnapshot returns slowly, the job times out; the prune step assumes the snapshot did not complete; the next run skips. Backup gap.

Compounding: many platforms’ cross-region snapshot replication runs from us-east-1. If the source region is degraded, the cross-region copy never starts. The DR posture for other regions weakens.

Region-risk patterns that survive the next event

Multi-region for at least one capacity-critical service. Not the whole platform, but enough that ‘us-east-1 is degraded’ is not the same as ‘we are down.’
Snapshot pipelines that handle slow API gracefully. Don’t assume completion; verify.
Capacity headroom that does not require new instances. The pods you have keep running.

Antipatterns

Single-region everything. Cheap until it isn’t.
Backup verification by job-success-flag only. The flag lies during these events.

What to do this week

Three moves. (1) Identify the one most-critical service that should survive a us-east-1 EBS event; build a multi-region story for it. (2) Add a backup-completeness verification (read back snapshot metadata) to your nightly job. (3) Tabletop the EBS-degraded scenario at your next incident-response drill.