Observability Intermediate By Samson Tanimawo, PhD Published Sep 27, 2026 6 min read

Cardinality Explosion: The Hidden Killer of Observability Bills

One developer adds a label to a metric. The vendor bill triples next month. The label set was unbounded, every user_id became its own time series. Cardinality is the silent thing that breaks observability budgets.

What cardinality is

Cardinality is the number of unique time series your metrics emit. A counter http_requests_total with labels {method, status} has 4 methods × 5 status classes = 20 series. Add {user_id} with 100k users and you have 2 million series. The TSDB groans.

The math is multiplicative, not additive. Each label dimension multiplies the series count by the number of distinct values in that label. A metric with 5 labels of 10 values each is 100,000 series. Adding one more 10-value label takes you to 1 million. The growth is exponential in the number of labels.

Why it matters. Time-series databases are optimised for moderate cardinality. Prometheus comfortably handles 5-10M active series per instance. Beyond that, queries slow to crawl, retention costs balloon, and the database's storage layer hits its limits. Hosted observability vendors charge per series; cardinality directly drives cost.

Your real cardinality budget

Self-hosted Prometheus comfortably handles 5-10 million active series per instance. Hosted vendors price per series, typical contracts allow 1-5 million before surcharges hit. Either way, you have a budget; treat it like one.

The discovery moment. Most teams have no idea what their cardinality budget is until they hit it. The signals: queries that used to take 200ms now take 30 seconds. Vendor bills with surprising line items. Retention dropped from 90 days to 30 because storage filled up. Each is the cardinality budget telling you it's exceeded.

The right framing. Treat cardinality as a budget you're spending. Each new label you add costs from the budget. Each user_id you push as a label is a withdrawal. The team that thinks of it as a budget makes deliberate choices; the team that doesn't gets surprised.

Three ways it explodes accidentally

The unbounded-label trap. An engineer adds user_id as a label "for debugging." They don't realise that each user creates a new series. Three months later, the metric has 500k series and the dashboard query times out. The engineer who added it has moved on; nobody remembers why.

The pod_name accumulation. Kubernetes deployments rotate pod names; old pod names persist in the TSDB until series age out. A service with 20 replicas deploying daily produces 600 new series per month from rotating names. Over a year, the metric balloons.

The histogram blow-up. A latency histogram with 10 buckets, partitioned by 5 endpoints and 4 status codes, is 200 series. Add 30-day percentile recording, and you've got 200 × 30 = 6000 series for one metric.

Containing it

Three techniques. Recording rules aggregate high-cardinality metrics into low-cardinality summaries. Exemplars let you keep one representative trace per bucket without storing every request as a metric. Sampling at the agent drops series by ratio, better than dropping logs entirely.

Recording rules in detail. Take the high-cardinality metric (e.g., per-user request count); precompute the aggregation (e.g., per-tier request count); store only the aggregation. The original metric can have short retention (1 day); the aggregation has full retention. The team queries the aggregation; cardinality is bounded.

Exemplars in detail. Instead of storing every individual measurement as a metric label, the metric stores aggregate counts AND a single representative trace ID per bucket. Click on the metric chart, jump to a real trace from that bucket. Get the granularity benefit of high cardinality without the cost.

Agent sampling. The Prometheus agent (or OpenTelemetry collector) drops 90% of series before they reach the database. Sampling preserves shape, query results are within statistical bounds, but storage cost drops 10x. Trade-off: rare events may not be captured.

Measuring before it bites

Your TSDB has a topk(10, count by (__name__)({})) query that lists the top-cardinality metrics. Run it weekly. The metric that just spiked is the next thing you need to limit.

The diagnostic queries. count by (__name__)({}) {= top metrics by series count}. count by (job)({}) for top services. The output reveals which metrics or services dominate. Most teams find that 80% of cardinality comes from 3-5 metrics.

The action workflow. Find the high-cardinality metric. Identify the unbounded label. Decide: aggregate the label out, sample down, or accept (some are legitimate). Most resolve to aggregation. The exercise takes a half-day per metric; produces sustainable storage costs.

Common antipatterns

The "we'll fix it next quarter" deferral. Cardinality grows during the deferral. The next quarter, the problem is 2x harder. Fix at the first sign.

Adding labels for "debugging." Engineer adds request_id as a label to debug a specific issue. Forgets to remove it. The label persists; cardinality balloons. Use traces (or logs) for high-cardinality debugging; metrics are for aggregates.

Histograms with too many buckets. 20-bucket histograms feel "more accurate." They're 4x the cardinality of 5-bucket histograms. Use the smallest bucket count that captures the distribution shape.

Per-customer dashboards built on per-customer metrics. Easy to want; expensive to support. Aggregate at the metric level, then derive per-customer views from joins with a sparse customer-tag table.

What to do this week

Three moves. (1) Run the topk-by-name query against your TSDB. The list reveals where to start. (2) For each metric in the top 10, identify the unbounded label and decide how to bound it (aggregate, sample, or remove). (3) Add a cardinality monitoring dashboard. Top metrics weekly. The visible trend prevents the next surprise.