Alerts From Distributed Traces

Trace-based alerts catch issues metrics miss.

What trace alerts catch

Trace-based alerts fire on patterns inside a request that metrics cannot see. Slow span in auth service, retry-storm against downstream, error chain across services; metrics see aggregates while traces see the path. A 1% error rate hides a 20% error rate inside a single user segment, and a trace alert finds it. Use cases: latency regressions in specific endpoints, error budgets per customer tier, dependency-chain failures.

How to build them

The build path is well-understood. OpenTelemetry collector with tail-sampling processor (sample slow or errored traces at 100%, normal traces at 1%); Datadog APM, Honeycomb, Lightstep all expose trace queries that can drive alerts (define the alert as a query: count of traces matching a pattern over 5 minutes); alert rule example: more than 10 traces with span.duration > 2s on /checkout in 5 minutes.

What makes a trace alert sharper

Three patterns push trace alerts past metrics. Pattern-matching across spans (“auth span failed and was retried 3 times in the same trace” cannot be expressed in metric terms); per-customer-tier alerts (filter traces by tenant ID, fire only on enterprise tenants); cross-service cascades (if service A errors and service B retries 5x, fire a single alert at the cascade level).

The cost of trace alerts

Trace alerts are not free. Trace storage is expensive (tail sampling reduces volume but adds operational complexity, budget 10-20% of observability spend on traces); cardinality matters (per-tenant labels on spans are useful but explode quickly, cap labels you index for alerting); trace alerts evaluate slower than metric alerts (expect 1-3 minute lag versus 30-60 seconds for metrics).

When to add trace alerts

The threshold is scale and tier-based SLOs. Once your stack has more than 10 services or per-tenant SLOs (below that, metric alerts are enough); start with 3 alerts (one for the most painful regression you have seen, one for the highest-value endpoint, one for cross-service cascades); don’t replicate metric alerts in traces because traces are for what metrics cannot express.