Distributed Tracing Sampling Strategies That Don't Lie
Tracing every request is expensive. Tracing too few requests gives you statistically misleading data. The sampling decision is one of the most consequential observability calls you make.
The sampling decision
Tracing 100% of requests gives perfect data and infinite cost. Tracing 1% gives 100× cheaper data and statistical lies (rare endpoints get no traces). The sampling strategy decides where on that line you live.
The cost reality. At a service handling 10k requests/second, 100% tracing produces 864 million traces per day. Storing them costs thousands per month at any vendor. Most companies can't justify it. The sampling decision is therefore mandatory; the only question is which strategy.
The information loss. Sampling drops data; some questions become unanswerable. "Show me every trace from customer X yesterday" might fail if no traces from X were sampled. The sampling strategy determines which questions remain answerable.
Head-based sampling
Decide at the request entry whether to trace this one. Cheap (no buffering), simple. Loses information: you don't yet know if a request will fail, so failures get sampled at the same rate as successes. Defaults: 1-10%.
The implementation. The first service in the request path generates a trace ID with a sampling decision encoded (e.g., "this trace will be sampled"). The decision propagates downstream via trace context headers. Either every span in the request is captured or none are. No buffering required.
The trade-off. Simple to implement, no memory cost. Fails on rare important traces, the 1% of requests that error don't get prioritised. Failures and successes are sampled at the same rate. Means you have proportionally fewer error traces (which are most valuable) than you'd want.
When head-based works. High-volume services where statistical sampling is fine. Most read-heavy paths. Internal services. The 1-10% sample rate captures the common cases; rare events are out of scope.
Tail-based sampling
Buffer all traces in the Collector, decide which to keep AFTER the trace completes. Lets you keep all errors and slow traces; sample successes at low rate. Cost: requires a stateful Collector that can hold seconds of traces in memory.
The implementation. Collector buffers all spans for a trace until the trace is complete (or a timeout). Then evaluates: was this trace fast and successful? Sample at low rate (1%). Was it slow or errored? Keep with high rate (50-100%). The decision is informed by trace outcome.
The cost. Memory in the Collector to buffer in-flight traces. For 10s of traces and 100k traces/sec, that's 1M traces buffered = ~1GB+ of memory. Stateful Collector means horizontal scaling is harder (every trace needs to land on the same Collector to assemble). The complexity is real.
When tail-based wins. Mid-to-high volume services where you care about errors and slow paths but can't afford 100% storage. Most production setups outside very-high-throughput cases. The complexity is worth it for the data quality.
Adaptive sampling
Sample more from rare endpoints, less from common ones. Each endpoint gets roughly the same number of traces per minute regardless of traffic share. Excellent for catching issues on low-traffic paths the head-based approach would never trace.
The mechanism. The sampler tracks per-endpoint rates. High-volume endpoint at 10k QPS sampled at 0.1%; low-volume endpoint at 10 QPS sampled at 100%. Both produce ~10 traces per minute. Per-endpoint visibility is uniform.
The benefit for rare paths. The 5-times-a-day admin endpoint sees 100% sampling under adaptive; 0.5% under head-based. When that admin endpoint breaks, adaptive has plenty of traces; head-based has none. The visibility for rare paths is dramatically better.
The implementation cost. The sampler needs to track per-endpoint rates, which requires shared state. Higher complexity than head-based; lower than tail-based. Sweet spot for some teams.
Error-priority sampling
Always trace errors. Always trace requests over a latency threshold. Sample successes at low rate. Common pattern as part of tail-based.
The mechanism. The sampling decision happens on trace completion (so requires tail-based infrastructure). Logic: if HTTP status >= 400, keep at 100%. If latency > 1 second, keep at 100%. Otherwise sample at 1%. The result: every error and slow trace, plus a statistical sample of successes.
The benefit. The traces engineers care about most (errors, slow requests) are kept fully. Storage cost is similar to head-based at low rate; data quality is dramatically better.
The hybrid most teams adopt
Head-sample at 5% as a baseline, tail-prioritise errors and slow requests, adaptive-cap so no endpoint exceeds 100 traces/minute. The Collector runs the tail logic; the cost stays bounded; the data stays useful.
The hybrid's strengths. 5% baseline gives statistical visibility into normal traffic. Error/slow priority captures the interesting cases. Adaptive cap prevents one runaway endpoint (a bot hammering /api/v1/foo) from dominating storage.
The trade-offs. Most complex of the strategies. Requires sophisticated Collector configuration. Higher operational overhead. Pays back at any meaningful scale; not worth the complexity for small teams.
Common antipatterns
Sampling forever at 1%. Team set 1% rate years ago; never revisited. Volume grew 10x; sampling stayed the same. Now you have 10x the data with no proportionate increase in queries that work. Adjust periodically.
Sampling decisions without backend awareness. Team picks 50% sampling; vendor charges by trace count; bill triples. Always cost-model before changing rates.
The "trace everything" defense. "We need every trace for debugging." Cost makes this untenable above moderate scale. Better data quality (errors at 100%) trumps quantity (everything at 1%).
No sampling and reactive cost management. No sampling configured; production traffic hits the backend at 100%; team gets a $50k surprise bill. Configure sampling on day one.
What to do this week
Three moves. (1) Document your current sampling strategy. Most teams don't know what theirs is, they default to whatever the SDK ships with. (2) Compute your trace volume and storage cost. Often surprising; informs the strategy choice. (3) If on head-based at 1-5%, evaluate moving to tail-based with error priority. The data quality improvement is worth the complexity.