Cardinality Explosion Alert
Cardinality spikes are the most expensive monitoring problem. Alert on them.
Why cardinality matters
Time-series databases bill on unique label combinations. A label like user_id explodes cardinality and the bill.
Above 1M active series, Prometheus and Cortex slow down. Above 10M, queries time out and ingestion drops.
Cardinality is the silent killer of observability stacks. The first warning is the bill, not the alert.
The cardinality alert
Prometheus: alert when `prometheus_tsdb_head_series` grows more than 30% week-over-week.
Or alert on per-metric cardinality: `count(count by (__name__) ({__name__=~".+"})) > 100000` per metric.
Datadog and Honeycomb expose cardinality dashboards; alert on the per-metric column when it crosses a budget.
Cardinality budgets
Per-team budget: 1M active series. Per-metric: 100k unique combinations.
Above budget, the team must drop a label or aggregate. CI fails the deploy that adds a high-cardinality label.
Publish the budget and current usage in a dashboard. Visibility is half the discipline.
Common cardinality offenders
User IDs, request IDs, full URLs in labels. Replace with hashes, route patterns, or bucketed values.
Container IDs from Kubernetes that include random suffixes. Use deployment name instead.
Customer IDs at the metric level. Move to per-tenant aggregates and use traces for per-tenant detail.
How to fix an explosion
Drop the offending label at the scrape config or OTel collector. Prometheus relabel_config, OTel attribute processor.
Aggregate up: `sum by (deployment) (...)` replaces per-pod metrics with per-deployment.
Add a metric_relabel_configs rule that drops the worst series. Document what got dropped so debugging is possible.