Cardinality Explosion: How to Detect It Before Your Bill
Cardinality explosion is the silent killer of observability budgets. The detection is easy; most teams just have not added it.
What cardinality is, and why it explodes
Cardinality is the number of unique label combinations a metric has. http_requests_total{method="GET", status="200", route="/api/users"} with 4 methods, 5 statuses, 30 routes is 600 series. Per-user-id labels would be millions.
Cardinality scales multiplicatively across labels. Adding one new label with 100 values 100x’s your storage. The bill arrives at the end of the month.
The four metrics-of-metrics that catch it
- 1. Active-series count, per metric. Spikes flag the explosion before the bill.
- 2. Series-creation rate. A burst of new series usually means a label leaked.
- 3. Per-tenant cardinality. One bad customer or one bad pod often dominates.
- 4. Top-N labels by unique value count. The label about to break the system is in this list.
Per-metric label budgets
Every metric should have a documented cardinality budget, e.g., ‘http_requests_total max 10,000 series.’ Enforced at ingest. New series above the budget get dropped or flagged.
Prometheus has --storage.tsdb.max-block-chunk-segment-size, but the real protection is application-side: do not emit per-user-id labels; bucket continuous values; sanitize URL paths to template form before emitting.
Cleanup for already-exploded metrics
When a metric has already exploded, the temptation is to delete it. Resist for a week, queries against the old data are still useful for triage.
The actual cleanup: stop emitting the bad label first; add a relabel rule at the scrape config to drop the offending label retroactively in queries; let TTL expire the bad series naturally over 30-90 days.
Antipatterns
- Per-request-id labels. Always wrong. Use traces for per-request data.
- URL paths without templating.
/api/users/12345creates one series per user. - No cardinality dashboard. The first sign should not be the bill.
What to do this week
Three moves. (1) Add an active-series-by-metric panel to your platform dashboard. (2) Audit your top-10 metrics by series count; document the budget. (3) Add a relabel rule to drop the worst high-cardinality label.