Real Outage: An API Rate-Limit Misconfiguration
A single character typo in a rate-limit config silently capped paying customers at 1 request per second for 38 minutes. The internal dashboard stayed green the whole time.
Timeline
Anonymised composite of an API-rate-limit-config incident at a major payments API. Times in UTC.
13:17, A platform engineer pushes a config change tightening rate limits for a category of free-tier endpoints. Intent: change the limit from 100 RPS to 10 RPS for that category. Actual change: the YAML key was misnamed and applied the new value to a different category that included paying-customer endpoints.
13:17:30, Config rolls out. The rate-limit service starts enforcing 10 RPS on a category that includes ~14% of paying-customer traffic. But there’s a second bug: the typo also dropped a unit suffix, so the value was parsed as “10 per minute” instead of “10 per second”. Effective limit: 1 RPS.
13:18, Customer integrations start seeing 429 responses. The 429 includes a Retry-After header so the SDKs back off; from the customer side this looks like “the API is slow”, not “the API is broken”.
13:30, First customer support ticket. Customer reports their checkout integration is timing out. The support engineer triages it as a customer-side rate-limit issue (because, ironically, the 429s look exactly like normal rate-limiting) and asks them to check their request volume.
13:39, Three more support tickets, all from large customers. Pattern recognition kicks in. Support escalates.
13:41, Engineering identifies the rate-limit config push from 13:17. Detection: 24 minutes.
13:55, Config rollback completed. Recovery confirmed at 13:55. Total customer-impact: 38 minutes for ~14% of traffic.
The detection lag
22 minutes from first 429 to acknowledged page. The dashboards were entirely green. Latency was healthy (the API responded in 4ms with a 429). Error rate was low (429 isn’t counted as an error in most internal metrics). Throughput was healthy on a per-server basis (the rate-limiter was doing its job).
The signal that should have fired: customer success rate. Not server success rate, not 5xx rate, not latency. The percentage of customer requests that resulted in a useful response. That number dropped 14% at 13:17 and stayed dropped for 38 minutes. Nothing alarmed on it because the team didn’t have it as a metric.
The deeper failure: 429s are first-class API responses to a rate-limiter team and second-class to everyone else. Engineering treated 429 as “working as designed”; customers treated it as “the API doesn’t work”. Both views were correct from their angle and the org didn’t have a unified definition of “does the API work”.
The cascade
The cascade was customer-visible, not internal. Customer integrations that hit the limit started piling up retries. Some used exponential backoff (good); others retried immediately (bad). The bad ones produced sustained 1-RPS-per-customer traffic with 60+ requests queued at any moment. Their integrations were effectively stalled.
One large customer’s checkout flow timed out at 30 seconds. They have ~400 checkouts per minute. With 1 RPS, the queue grew at ~6.6 requests per second; checkouts started timing out at ~5 minutes in. By minute 30, the customer was visibly down to their end users despite the API being technically “up”.
What the runbook said
The rate-limiter runbook was for a specific failure mode: rate-limiter service itself going down. Step 1: failover to backup region. Step 2: page the rate-limiter team. The runbook had nothing for “rate limits are working but wrong”.
The on-call who eventually picked up the incident bounced through three runbooks (API gateway, rate-limiter, billing-API) before working out what was happening. None of them had a step like “check the most recent rate-limit config push”. The deploy log existed; nothing pointed at it.
What actually fixed it
Roll back the config push from 13:17. The mechanics took 30 seconds. The 38-minute incident was almost entirely about not realising the config push was the cause.
The team also published a customer-facing status update at 13:50 acknowledging “degraded API performance for some endpoints”. They did not mention rate limits in the public update. The internal postmortem revisited that decision and concluded the comms should have been more specific, customers were confused about whether they had done something wrong, and a sentence saying “our internal rate-limit configuration was incorrect” would have prevented hours of customer-side debugging.
Action items
- Customer-success-rate SLO. New SLO measuring “percentage of customer requests that returned 200”, separated by customer tier. Drops more than 2% in 60 seconds: page. Detection floor for a future event: under 90 seconds.
- Rate-limit config schema enforcement. The YAML schema now requires explicit unit suffixes (
10/snot10). Configs without units fail CI. The class of bug where unit ambiguity changes meaning by 60x is gone. - Staged rollout for limit changes. Rate-limit config now rolls out to 1% of traffic for 5 minutes, 10% for 5 more, then 100%. Bad config gets caught at 1% with minimal impact.
- Recent-config link in runbooks. Every API-side runbook now links to the deploy log filtered to the past hour. The on-call sees recent changes immediately.
- Customer-facing status comms template. New template for “internal config issue affecting customer integrations”. Specific enough to stop customers debugging their own systems.
The architectural change
The architectural answer was: rate limits are part of the product surface, not an implementation detail. Any change to a rate limit on a customer-facing endpoint now requires a product-management sign-off and a customer-comms plan, even if the change is “just” an internal config push. The platform-engineering team can’t ship the change unilaterally.
The pushback was “this slows us down for routine quota tuning”. The response: routine quota tuning broke the API for paying customers for 38 minutes. The quota changes that need to be fast are the ones that loosen limits during incidents, and those have a separate path.
The deeper lesson, written into the team’s API-design doc: “A 429 is a customer-visible response. It belongs in the SLO conversation, not the rate-limiter conversation.” The internal/external boundary on response codes was wrong; this incident drew it correctly.