Postmortems Intermediate By Samson Tanimawo, PhD Published Dec 13, 2026 10 min read

DynamoDB Throttling Cascade: A Postmortem Pattern

The cascade is well-documented and avoidable. Most teams discover it after the incident, not before.

The hot partition

DynamoDB partitions data by hash key. If one hash key gets 10x the traffic of others (a ‘hot partition’), that single partition’s capacity is the bottleneck. Throttling fires for that key while the rest of the table looks healthy.

Common causes: a feature flag enabled for a single big customer, a celebrity user’s row, a poorly distributed key (using customer-id when 1% of customers are 90% of traffic).

The retry storm

How the cascade widens

Adjacent services that depend on the throttled one start timing out. Their clients retry. Now the cascade has spread. By minute four, half your service mesh is in retry-storm against half-broken backends.

The fingerprint: error rate spike + dramatic increase in retry-related metrics + DynamoDB throttle metric concentrated on one or two partitions. If you have all three, this is the pattern.

Three guardrails that stop it

Antipatterns

What to do this week

Three moves. (1) Add a per-partition throttle alert to your most-trafficked DynamoDB tables. (2) Audit retry config across services; cap retries at 3 with jitter. (3) Identify your top-3 hot tenants; design a sharding scheme for them.