API Rate Limiting Patterns
Rate limit APIs. The patterns.
Tiered
Rate limiting protects the API from being overrun by a single consumer at the expense of everyone else. The simplest implementation is one limit for everyone, but that produces a worst-of-both-worlds outcome: aggressive enough to throttle paying customers, lenient enough that abusers can still cause damage. The right model is tiered limits keyed to the consumer's tier.
What tiered rate limiting looks like:
- Per consumer tier.: Free tier customers, paid tier customers, internal services, and trusted partners each get a different limit. Free might be 100 requests per minute; paid might be 1,000; partner integrations might have explicit per-contract limits. The tier is identified from the auth token at the gateway.
- Different rates by use case.: A single account might have multiple limits applied depending on the operation. Reads cheaper than writes. Bulk endpoints have lower limits than transactional ones. Discovery endpoints are unlimited; mutation endpoints are tightly capped. Each route has a per-tier limit.
- Standard pattern, well understood.: Most API gateways (Kong, Apigee, AWS API Gateway, Cloudflare) ship tiered rate limiting as a built-in feature. The work is configuration, not custom code. The cost of adopting it is small and the value is high.
- Tier upgrade as a sales motion.: When a free-tier customer hits limits, the response includes a clear upgrade path. The 429 message is informative, not adversarial: "you have hit the free-tier limit, paid plans start at $X and offer 10x the rate, see [link]." Rate limiting becomes a sales channel, not just a defense.
- Per-customer overrides.: Some customers need higher limits for legitimate reasons (large enterprise integrations, batch jobs, migration tooling). The tier system supports per-customer overrides documented in the customer record. Overrides are deliberate and auditable, not magic configurations only one engineer remembers.
Tiered limits make rate limiting a product feature instead of a security cudgel. Customers understand them, sales can monetize them, support can troubleshoot them.
Token bucket
The mechanism that actually enforces the rate is the token bucket algorithm. It is the standard implementation across nearly every modern rate limiter because it gets the trade-off right: smooth average rate enforcement, with tolerance for short bursts that real workloads need.
- Tokens refill at a constant rate.: The bucket holds a maximum of N tokens. Tokens are added at R per second up to the cap. Each request consumes one token (or more, for expensive operations). When the bucket is empty, requests are rejected or queued.
- Bursts tolerated up to bucket size.: A consumer that has been quiet for 60 seconds has accumulated a full bucket of tokens. They can now spike to the bucket size at once, then resume sustained rate. This matches how real workloads behave: idle most of the time, with bursts at user actions.
- Sustained rate enforced strictly.: Once the bucket is exhausted, the consumer is throttled to the refill rate. They cannot sustain higher than R requests per second on average, regardless of how they batch the bursts. This is the property that makes the limit actually work.
- Smooth, not stepwise.: Compared to fixed-window rate limiting (which lets a consumer fire 2X requests at the window boundary), token bucket smooths the rate evenly. The result is a more predictable load on the backend, which is half the reason rate limiting exists in the first place.
- Implementations standard.: Redis with INCR/EXPIRE, in-memory at the gateway, distributed via Memcached. The choice depends on consistency requirements (do all gateway instances need to share a counter or can each instance enforce its own quota with local accounting). Both are valid; both are well-trodden.
The token bucket is the right default. The cases where it is wrong (specific traffic shapes, hard quotas with no burst, fairness-weighted scheduling) are rare enough that picking token bucket and adjusting the parameters is the right starting point for most teams.
Monitor
Rate limiting that is not monitored is rate limiting you do not know is working. Several metrics have to be tracked continuously to keep the practice honest.
- Limit-hit rate per tier.: The percentage of requests that hit the rate limit, broken down by consumer tier. A free tier with a 30% hit rate is probably appropriately tuned. A paid tier with a 15% hit rate is probably under-provisioned and customers are about to complain.
- Limit-hit rate per customer.: Within a tier, individual customers can be outliers. A single free-tier customer hitting the limit 1000 times a day is probably a misconfigured client (worth helping them fix) or a scraper (worth blocking).
- 429 rate as an SLO signal.: If 429 responses are correlated with customer churn or with support tickets, the rate limits are too aggressive. Tune them based on the impact, not based on architectural intuition.
- Anomalies investigated.: Sudden spikes in 429s from a previously well-behaved consumer point to a deployment they made (their client started misbehaving), a third-party service they integrated (which is amplifying their request volume), or an attempted abuse from a compromised account. Each requires different follow-up; the trigger is the anomaly detection.
- Catches abuse early.: Rate limit telemetry is one of the leading indicators of abuse. A free-tier account suddenly maxing the limit every minute, an API key being used from new IP ranges, or a sudden spike in unique user agents all show up in the rate limit logs before they show up anywhere else.
Tiered limits with token bucket enforcement and continuous monitoring is the rate limiting pattern that scales from a startup's first API to a billion-request-per-day platform. Nova AI Ops integrates with API gateway rate-limit telemetry, surfaces per-tier and per-customer hit rates as first-class metrics, and flags the anomalies that distinguish a legitimate traffic spike from an abuse pattern that needs attention.