Traces Cost Optimization

Sampling drives trace cost.

Overview

Trace cost is dominated by sampling rate. Full sampling is rarely affordable at scale; smart sampling preserves investigation value at a fraction of the cost. The discipline is picking the right sampling strategy per workload.

Sampling drives trace cost. Each retained trace costs storage and ingest. The sampling rate is the dominant cost lever.
Head-based sampling. Up-front sampling decision per trace. Cheap to implement; misses interesting traces.
Tail-based sampling. Post-decision sampling once the trace completes. Always keeps errors and slow traces; modern best practice.
Per-tier sampling plus error prioritisation. Different rates per service tier; errors and slow traces always retained regardless of base rate.

The approach

Three habits keep trace cost matched to investigation value: tail-based sampling as default, error-trace always-keep policy, and quarterly cost audits.

Tail-based sampling. Decide what to keep after the trace completes. Errors and high-latency traces survive; the rest get sampled at base rate.
Error prioritisation. Always keep traces with errors or unusual latency. The investigation set is the bad traces, not random ones.
Per-tier sampling rate. Critical services sample at 10 percent; internal batch sample at 1 percent. Match rate to investigation need.
Quarterly audit plus documented policy. Quarterly trace-cost review catches drift; per-team sampling policy lives in the wiki.

Why this compounds

Each correctly-sampled trace produces investigation value at controlled cost. The team learns sampling through repeated review; new services ship with rates that match their tier from day one.

Cost efficiency. Right sampling matches workload. Critical services keep what matters; long-tail services do not pay for noise.
Investigation quality. Error traces preserved. The traces operators actually need during incidents are the ones that survive sampling.
Operational fit. Right policy per tier matches priorities. Customer-facing services get richer traces; internal jobs get sparse ones.
Year-one investment, year-two habit. The first policy is investment. By year two, every new service ships with a sampling decision.