Real Outage: A DynamoDB Throttling Cascade
A major payments provider lost 47 minutes of authorisation traffic to a single hot-key partition. The auto-scaling alarm lied. The runbook didn’t mention partitions. Here’s the write-up.
Timeline
All times in UTC. Dates anonymised; this is a composite write-up of the patterns we see in DynamoDB hot-partition incidents at scale.
14:02, Marketing pushes a promo email to 4.1M customers. Open rate spikes within 30 seconds; checkout traffic follows.
14:06, Authorisation latency p99 climbs from 38ms to 210ms on the auth_attempts table. Provisioned WCU on the table is 80,000; consumed WCU is 11,200. Auto-scaling sees no reason to act.
14:09, First customer complaint in #payments-support. Auth failure rate is now 4.7%. The on-call dashboard still shows green; the per-table CloudWatch metric is averaged across partitions.
14:14, On-call acks a synthetic-monitor alert. Latency p99 is 890ms. ProvisionedThroughputExceededException rate is 38% on a single partition key prefix used for one promo cohort.
14:23, Engineer manually doubles the table WCU to 160,000. Throttle rate doesn’t move. The hot partition is capped at ~3,000 WCU regardless of table-level provisioning.
14:38, Second engineer recognises the partition shape. Ships a code change that prepends a 4-bit shard key to writes for that cohort. Deploy takes 6 minutes.
14:46, Auth failure rate drops below 0.5%. Recovery confirmed at 14:49. Incident closed at 14:53. Total customer-impact window: 47 minutes.
The detection lag
22 minutes from first customer pain to acknowledged page. That’s the number that should embarrass everyone in the room. The alert that finally fired was a synthetic monitor, not the database, not the application, not the load balancer. Everything closer to the actual data path stayed green because the metrics were all averaged.
The deeper failure: the team had a single “DynamoDB latency” SLO computed from per-table p99. A hot partition that spikes one cohort to 2.4 seconds while the other 99.7% of traffic runs at 38ms doesn’t move the table-wide p99 enough to breach. The SLO was satisfied while customers were getting timeouts.
Lesson: if your DynamoDB table has any concept of “cohort”, “tenant”, or “customer segment” in the partition key, your SLO needs to be computed per-cohort, not per-table. The math is the same; the alerting is what changes.
The cascade
The throttled writes didn’t fail cleanly. They retried with the AWS SDK’s default exponential backoff, which on a busy auth path meant the calling service started piling up in-flight requests. The connection pool to DynamoDB filled at 14:11. By 14:14 the upstream HTTP server was shedding load with 503s, not because anything was wrong with that server, but because every worker thread was blocked in an SDK retry loop.
The downstream services (fraud-scoring, ledger writes) saw the 503 wave and started their own retries. By 14:18 we had three independent retry storms: SDK against DynamoDB, HTTP against the auth service, and the fraud team’s circuit breaker flapping open and closed every 12 seconds. Effective throughput dropped to ~30% of nominal.
The hot partition was a single config push. The cascade was four layers of well-meaning retry logic compounding.
What the runbook said
The DynamoDB runbook was four years old and had three steps:
- If latency is high, increase provisioned capacity.
- If errors persist, switch to on-demand mode.
- If still bad, page the database team lead.
None of those steps mentioned partitions. The first one was the trap: an engineer doubled the WCU at 14:23 and watched nothing happen. The runbook gave no indication that table-level provisioning is meaningless when a single partition is saturated. The on-demand switch (step 2) would have made it worse, on-demand still partitions writes the same way, and switching mode under load takes minutes.
Step 3 was the actually correct move and it took 16 minutes longer than it should have because steps 1 and 2 looked authoritative.
What actually fixed it
Write sharding. The cohort that hit the hot partition was identified by a single string prefix, let’s call it promo_a/. Every write to that cohort hashed to one partition because DynamoDB partitions on the entire partition key value. The fix was to rewrite that one code path:
// before
pk = "promo_a/" + customer_id
// after
shard = hash(customer_id) % 16
pk = "promo_a/" + shard + "/" + customer_id
16 logical shards, 16 physical partitions, write throughput multiplied by 16. The change was eight lines including the read path. It took 6 minutes to deploy because the team had a working CI/CD pipeline; without one, this is a 45-minute incident at minimum.
Action items
Four came out of the postmortem and all four shipped within two weeks:
- Per-partition CloudWatch metrics via the AWS contributor-insights feature. The alarm now fires on partition-key heat, not table-average WCU. Detection lag in a follow-up game-day was 90 seconds.
- SLO disaggregation by cohort. Every customer-segment partition prefix gets its own latency SLO. Hot-cohort incidents now show as red on the dashboard within one minute.
- Runbook rewrite. The first step in the new runbook is “identify the partition key shape and check contributor-insights for skew”. Doubling capacity is no longer step 1.
- SDK retry budget. Auth service now caps in-flight retries at 200 per worker; excess fail fast with a circuit-breaker open. The cascade pattern can’t happen the same way again.
The architectural change
The bigger change took six weeks: every new DynamoDB table at this org now requires a write-sharding review at design time. The review is a one-page doc that asks four questions: what’s the partition key, what’s the cardinality, what’s the worst-case skew, and how will you re-shard if the answer to question three is wrong. No table goes to production without the doc.
It sounds bureaucratic. It is. It’s also the cheapest way to prevent the next 47-minute outage. The hot-partition pattern accounts for ~30% of DynamoDB-related incidents we’ve seen at scale; almost all of them are catchable at design time with five minutes of thinking about the partition shape.
The postmortem closed with a line that ended up on the team wiki: “Provisioned capacity is a table-level lie. Partitions are the unit of throughput. Design accordingly.”