DynamoDB Throttling Cascade: A Postmortem Pattern
The cascade is well-documented and avoidable. Most teams discover it after the incident, not before.
The hot partition
DynamoDB partitions data by hash key. If one hash key gets 10x the traffic of others (a ‘hot partition’), that single partition’s capacity is the bottleneck. Throttling fires for that key while the rest of the table looks healthy.
Common causes: a feature flag enabled for a single big customer, a celebrity user’s row, a poorly distributed key (using customer-id when 1% of customers are 90% of traffic).
The retry storm
- Throttled requests retry. Most clients retry exponentially, but with too low a backoff cap. After throttling, the client retries within seconds; the partition is still over capacity; throttling continues; clients retry harder.
- The amplification is multiplicative across clients. 100 pods all retrying a throttled request multiply the load on the already-overloaded partition. Latency climbs; throughput falls; users see errors.
How the cascade widens
Adjacent services that depend on the throttled one start timing out. Their clients retry. Now the cascade has spread. By minute four, half your service mesh is in retry-storm against half-broken backends.
The fingerprint: error rate spike + dramatic increase in retry-related metrics + DynamoDB throttle metric concentrated on one or two partitions. If you have all three, this is the pattern.
Three guardrails that stop it
- 1. Adaptive capacity (already on by default). But verify it on tables that pre-date the feature.
- 2. Client-side rate limiting before retry. Better to fail fast than to amplify.
- 3. Per-tenant rate limits at the application layer. Stop one customer from poisoning the partition.
Antipatterns
- Aggressive retry without jitter. Synchronizes the storm.
- Single hash-key per tenant for hot tenants. Pre-shard if one tenant is >5% of traffic.
- No per-partition throttle alerting. Table-level metrics hide the hot spot.
What to do this week
Three moves. (1) Add a per-partition throttle alert to your most-trafficked DynamoDB tables. (2) Audit retry config across services; cap retries at 3 with jitter. (3) Identify your top-3 hot tenants; design a sharding scheme for them.