Buyer's Guide By Samson Tanimawo, PhD Published Jul 2, 2025 11 min read

The Hidden Cost of Observability: Why Your Datadog Bill Grows Faster Than Your Team

Most observability cost overruns come from the same four patterns. Each one is easy to fix once you see it, and expensive if you don't.

Pattern 1: unbounded custom metrics

Datadog (and most metered observability vendors) charge per custom metric. “Custom” means anything with a unique name+tag-set combination.

Every engineer who adds a “let's track this thing” counter adds to the bill. A team with 50 engineers and no policy will hit tens of thousands of custom metrics within a year. Policy: require a dashboard link before any new custom metric merges.

Pattern 2: high-cardinality labels

The real killer. One label with user_id turns a metric with 10 tags into a metric with 10 × (number of users) tags. A service with 100k active users just became a 1M-custom-metric budget sink on a single counter.

Audit every tag on every custom metric. Anything unbounded (user_id, session_id, request_id, trace_id) should not be a metric tag, it should be a log field or a trace attribute.

Pattern 3: verbose logs for debug, always

Teams write DEBUG-level logs during incident investigation and then never remove them. Logs are charged per GB ingested; a single chatty service can add $5k/month that nobody notices until an audit.

Rule: DEBUG logs default to off in prod; turn on via a feature flag for 15 minutes when you need them, then off. This alone cuts log ingest 50,70% for most teams.

Pattern 4: traces that are never sampled down

Trace ingest is the fastest-growing line item for teams adopting OpenTelemetry without a sampling strategy. 100% sampling feels free in staging and catastrophic at prod volume.

Head-based sample at the SDK: 1,5% random sample of non-errored, non-slow traces. Tail-sample at the collector for errors and p99 latencies if you need them. The difference is 10,20× on spend.

How to audit your spend in a week

  1. Monday: pull the top 20 custom metrics by volume. For each, ask the owner what decision it drives.
  2. Tuesday: list every tag on the top 20. Circle anything unbounded.
  3. Wednesday: pull the log volume by service. The top 3 usually account for 60% of the bill.
  4. Thursday: review your trace sampling config. If there isn't one, that's the finding.
  5. Friday: one-page writeup, five concrete recommendations, ranked by dollar impact.

Teams that do this audit quarterly stay within 20% of their observability budget. Teams that don't overrun by 3,5× within two years.

Teams that audit quarterly stay within 20% of their observability budget. Teams that don't overrun by 3-5x within two years.

4
cost patterns to audit
1 week
to run the full audit

The conversation to have with finance

Finance does not want to cap the spend. Finance wants to understand why it grew 80% year over year while headcount grew 20%.

Bring the four patterns with numbers: unbounded custom metrics, high-cardinality labels, always-on DEBUG logs, unsampled traces. For each, quote the monthly dollar impact and the fix.

The conversation goes from 'we need to cut observability' to 'we need to fix cardinality on these three services'. That is a solvable engineering problem, not a budget fight.