Cardinality Explosion Alert

Cardinality spikes are the most expensive monitoring problem. Alert on them.

Why cardinality matters

Time-series databases bill on unique label combinations, and a single high-cardinality label like user_id explodes both cardinality and the bill. Above 1M active series Prometheus and Cortex slow down; above 10M queries time out and ingestion drops. The first warning is usually the bill, not the alert.

The cardinality alert

The cardinality alert lives at the observability layer. Prometheus: alert when prometheus_tsdb_head_series grows more than 30% week-over-week, or alert per-metric when cardinality crosses 100k. Datadog and Honeycomb expose cardinality dashboards with their own alert thresholds.

Cardinality budgets

Cardinality budgets make the discipline explicit. Per-team budget of 1M active series; per-metric budget of 100k unique combinations; above budget the team drops a label or aggregates, and CI fails the deploy that adds a high-cardinality label. Publish budget and usage in a dashboard.

Common cardinality offenders

The offenders are predictable. User IDs, request IDs, and full URLs in labels are the canonical mistakes; container IDs with random suffixes from Kubernetes; customer IDs at the metric level. Each has a specific remediation: hashes, route patterns, deployment name, per-tenant aggregates plus traces for detail.

How to fix an explosion

An active explosion needs immediate action. Drop the offending label at the scrape config or OTel collector; aggregate up so per-pod metrics become per-deployment; add a metric_relabel_configs rule that drops the worst series and document what got dropped so debugging stays possible.