Alerts Practical By Samson Tanimawo, PhD Published Sep 20, 2025 4 min read

Alerts From Distributed Traces

Trace-based alerts catch issues metrics miss.

What trace alerts catch

Trace-based alerts fire on patterns inside a request. Slow span in the auth service, retry-storm against a downstream, error chain across services.

Metrics see aggregates; traces see the path. A 1% error rate hides a 20% error rate inside a single user segment, and a trace alert finds it.

Use cases: latency regressions in a specific endpoint, error budgets per customer tier, dependency-chain failures.

How to build them

OpenTelemetry collector with the tail-sampling processor. Sample slow or errored traces at 100%, normal traces at 1%.

Datadog APM, Honeycomb, and Lightstep all expose trace queries that can drive alerts. Define the alert as a query: count of traces matching a pattern over 5 minutes.

Alert rule: more than 10 traces with span.duration > 2s on /checkout in 5 minutes. Fires when latency regresses on the path that matters.

What makes a trace alert sharper

Pattern-matching across spans. "Auth span failed and was retried 3 times in the same trace" cannot be expressed in metric terms.

Per-customer-tier alerts. Filter traces by tenant ID; fire only on enterprise tenants.

Cross-service cascades. If service A errors and service B retries 5x, fire a single alert at the cascade level.

The cost of trace alerts

Trace storage is expensive. Tail sampling reduces volume but adds operational complexity. Budget 10-20% of observability spend on traces.

Cardinality matters. Per-tenant labels on spans are useful but explode quickly; cap labels you index for alerting.

Trace alerts evaluate slower than metric alerts. Expect 1-3 minute lag versus 30-60 seconds for metrics.

When to add trace alerts

Once your stack has more than 10 services or per-tenant SLOs. Below that, metric alerts are enough.

Start with 3 alerts: one for the most painful regression you have seen, one for the highest-value endpoint, one for cross-service cascades.

Don't replicate metric alerts in traces. Use traces for what metrics cannot express.