Trace Sampling Strategy by Service Tier
Critical services sample at 100%; low-stakes services sample at 1%. The tier model and how to apply it without losing debugging value.
Define the tiers
Trace sampling strategy by service tier is the discipline of matching sampling rates to service criticality. Customer-critical services warrant high sampling; low-impact internal services tolerate aggressive sampling. The tiered approach optimizes cost-vs-coverage at each tier.
What the tiers look like:
- Tier 0: customer-critical paths.: Login, checkout, payment, account changes. The customer's primary interactions; failures here directly affect the business.
- 100% sampling.: Tier 0 services capture every trace. The cost is real but justified; the visibility into customer-critical paths is comprehensive.
- Tier 1: customer-facing but secondary.: Search, recommendations, profile management. Customer-facing but lower-stakes; some sampling is acceptable.
- Tier 2: internal services.: Backend processing, batch jobs, internal APIs. Less customer-visible; higher sampling rates fit.
- Tier 3: internal services with low impact.: Logging pipelines, internal tooling, low-priority workloads. Aggressive sampling produces dramatic cost savings.
- 1% head sampling.: Tier 3 services capture 1% of traces. The sampled subset provides baseline visibility; the cost is bounded.
The tiers match cost to value. Higher tiers get more visibility; lower tiers get more cost savings.
Tail sampling for tier 1-2
The middle tiers benefit from tail sampling. Tail sampling captures the high-value traces (errors, slow) while sampling the routine traces. The combination produces good signal at moderate cost.
- Sample 100% of slow or error traces.: Traces with errors or above latency thresholds are kept. The valuable traces are preserved; investigation has the data it needs.
- 1% of healthy ones.: Healthy traces are sampled at 1%. The sample provides representative visibility; the cost is bounded.
- Captures the bugs without paying for the boring traces.: The strategy optimizes the value-cost ratio. Bug investigation has the data; healthy throughput does not consume excess storage.
- Implementation: tail-sampling processor in the OTel collector.: The OpenTelemetry collector includes a tail-sampling processor. The processor buffers traces, evaluates rules, and decides retention. The pattern is well-supported.
- Not free; budget for it.: Tail sampling requires collector resources (memory, CPU). The collector sizing must account for the buffering; the cost is real but bounded.
Tail sampling is the right strategy for the middle tiers. The combination of cost reduction and signal preservation is favorable.
Review quarterly
The tier assignments are not static. New services launch; existing services change in importance; the team's review keeps the strategy current.
- Service tiers shift.: A service that was tier 3 might become tier 1 as it moves to customer-facing. A service that was tier 0 might become tier 2 as the company's priorities shift. The tier assignments evolve.
- New services launch.: Each new service receives a tier assignment. The assignment determines the sampling rate; new services are configured according to their tier.
- Quarterly review keeps the policy current.: The review revisits tier assignments. Has anything changed? Should any service move tiers? The review produces a current tier list.
- Cost dashboard per tier.: The cost is tracked per tier. Tier 0's cost reflects 100% sampling; tier 3's cost reflects 1%; the team sees the tier-by-tier economics.
- If tier 0 cost is exploding, the implementation has a leak.: Tier 0's cost should grow with traffic, not faster. Sudden cost spikes indicate something is wrong: cardinality leak, mis-configured sampling, traffic increase that warrants investigation.
Trace sampling strategy by service tier is one of those observability cost disciplines that pays off proportionally to trace volume. Nova AI Ops integrates with collectors and tracing backends, supports the tier-based approach, and produces the per-tier cost visibility that the team uses to manage the strategy.