Alerts From Distributed Traces
Trace-based alerts catch issues metrics miss.
What trace alerts catch
Trace-based alerts fire on patterns inside a request. Slow span in the auth service, retry-storm against a downstream, error chain across services.
Metrics see aggregates; traces see the path. A 1% error rate hides a 20% error rate inside a single user segment, and a trace alert finds it.
Use cases: latency regressions in a specific endpoint, error budgets per customer tier, dependency-chain failures.
How to build them
OpenTelemetry collector with the tail-sampling processor. Sample slow or errored traces at 100%, normal traces at 1%.
Datadog APM, Honeycomb, and Lightstep all expose trace queries that can drive alerts. Define the alert as a query: count of traces matching a pattern over 5 minutes.
Alert rule: more than 10 traces with span.duration > 2s on /checkout in 5 minutes. Fires when latency regresses on the path that matters.
What makes a trace alert sharper
Pattern-matching across spans. "Auth span failed and was retried 3 times in the same trace" cannot be expressed in metric terms.
Per-customer-tier alerts. Filter traces by tenant ID; fire only on enterprise tenants.
Cross-service cascades. If service A errors and service B retries 5x, fire a single alert at the cascade level.
The cost of trace alerts
Trace storage is expensive. Tail sampling reduces volume but adds operational complexity. Budget 10-20% of observability spend on traces.
Cardinality matters. Per-tenant labels on spans are useful but explode quickly; cap labels you index for alerting.
Trace alerts evaluate slower than metric alerts. Expect 1-3 minute lag versus 30-60 seconds for metrics.
When to add trace alerts
Once your stack has more than 10 services or per-tenant SLOs. Below that, metric alerts are enough.
Start with 3 alerts: one for the most painful regression you have seen, one for the highest-value endpoint, one for cross-service cascades.
Don't replicate metric alerts in traces. Use traces for what metrics cannot express.