Cardinality Explosion Alert
Cardinality spikes are the most expensive monitoring problem. Alert on them.
Why cardinality matters
Time-series databases bill on unique label combinations, and a single high-cardinality label like user_id explodes both cardinality and the bill. Above 1M active series Prometheus and Cortex slow down; above 10M queries time out and ingestion drops. The first warning is usually the bill, not the alert.
- Per-series billing. Time-series databases bill on unique label combinations; one bad label drives cost.
- 1M active series threshold. Prometheus and Cortex slow down; the first observability problem appears.
- 10M active series threshold. Queries time out, ingestion drops; the stack is no longer reliable.
- Silent killer. The first warning is the bill, not the alert; the discipline is to alert before the bill arrives.
The cardinality alert
The cardinality alert lives at the observability layer. Prometheus: alert when prometheus_tsdb_head_series grows more than 30% week-over-week, or alert per-metric when cardinality crosses 100k. Datadog and Honeycomb expose cardinality dashboards with their own alert thresholds.
- Prometheus growth alert.
prometheus_tsdb_head_seriesgrowing > 30% week-over-week is the canonical signal. - Per-metric cardinality.
count(count by (__name__) ({__name__=~".+"})) > 100000per metric. - Datadog and Honeycomb. Cardinality dashboards expose per-metric column; alert when it crosses a budget.
- Per-tool alert wiring. The alert lives in the tool that owns the metrics; supports investigation in the same UI.
Cardinality budgets
Cardinality budgets make the discipline explicit. Per-team budget of 1M active series; per-metric budget of 100k unique combinations; above budget the team drops a label or aggregates, and CI fails the deploy that adds a high-cardinality label. Publish budget and usage in a dashboard.
- Per-team budget. 1M active series; the headline number for the team.
- Per-metric budget. 100k unique combinations; the per-metric ceiling.
- CI deploy gate. Above budget, deploy fails until the team drops a label or aggregates.
- Dashboard for visibility. Budget and current usage published; visibility is half the discipline.
Common cardinality offenders
The offenders are predictable. User IDs, request IDs, and full URLs in labels are the canonical mistakes; container IDs with random suffixes from Kubernetes; customer IDs at the metric level. Each has a specific remediation: hashes, route patterns, deployment name, per-tenant aggregates plus traces for detail.
- User IDs and request IDs. Replace with hashes or route patterns; never raw IDs in labels.
- Full URLs in labels. Replace with route patterns or bucketed values; the URL diversity drives cardinality.
- Container IDs. Random suffixes from Kubernetes; use deployment name instead.
- Customer IDs at metric level. Move to per-tenant aggregates; use traces for per-tenant detail.
How to fix an explosion
An active explosion needs immediate action. Drop the offending label at the scrape config or OTel collector; aggregate up so per-pod metrics become per-deployment; add a metric_relabel_configs rule that drops the worst series and document what got dropped so debugging stays possible.
- Drop the label at scrape. Prometheus
relabel_config, OTel attribute processor; the surgical fix. - Aggregate up.
sum by (deployment) (…)replaces per-pod with per-deployment. - Drop worst series.
metric_relabel_configsrule; document what got dropped so debugging is possible. - Per-explosion postmortem. Each cardinality incident produces a documented cause and fix; supports prevention.