Observability Cost Engineering: Cutting Spend Without Losing Signal
Most observability bills can be cut in half without losing diagnostic capability. The discipline is mechanical; the savings are large.
Where the bill actually goes
Observability bills break down predictably: ingest (per-event cost), retention (per-GB-day), query (per-query or per-engineer), and platform fees. Most teams cannot tell which dominates.
First step: get the breakdown. Ask your vendor for a cost report by metric, by service, by retention tier. Without the breakdown, optimisation is guessing.
Cardinality reduction
- Cardinality is usually the biggest single lever. Every per-user, per-request, per-URL label multiplies storage. The cardinality-explosion playbook (covered separately) cuts 30-50% on its own.
- The discipline: per-metric series budgets, weekly review, ownership.
Sampling strategy
Sampling cuts span volume directly. Tail-based sampling with rules ‘keep all errors, keep p99 latency’ gives 80%+ reduction at zero diagnostic cost.
Apply the same to logs: structured logs at INFO sampled at 10% in production; ERROR retained at 100%.
Retention tiering
Hot tier (queryable in seconds, expensive): 7-14 days. Warm tier (queryable in minutes, cheaper): 30-90 days. Cold tier (S3, queryable in hours, cheapest): 1+ year.
Most queries hit hot data. Tier accordingly. The savings on retention are usually the second-biggest after cardinality.
Cost-aware dashboards
- Optimising spend without measuring queries. You delete the metric the team needed.
- Aggressive sampling without rules. The trace you need is gone.
- One-time cost cut. Spend creeps back without the discipline.
What to do this week
Three moves. (1) Get the per-metric / per-tier cost breakdown from your vendor. (2) Apply the cardinality playbook to your top-3 most expensive metrics. (3) Schedule the quarterly cost review.