Alerts From Distributed Traces
Trace-based alerts catch issues metrics miss.
What trace alerts catch
Trace-based alerts fire on patterns inside a request that metrics cannot see. Slow span in auth service, retry-storm against downstream, error chain across services; metrics see aggregates while traces see the path. A 1% error rate hides a 20% error rate inside a single user segment, and a trace alert finds it. Use cases: latency regressions in specific endpoints, error budgets per customer tier, dependency-chain failures.
- In-request patterns. Slow span, retry-storm, error chain; metrics aggregate over them.
- Path-aware vs aggregate. Traces see the path; metrics see the totals.
- Hidden subset error rate. 1% global hides 20% in a segment; trace alert finds it.
- Use cases. Endpoint regressions, per-tier error budgets, dependency-chain failures.
How to build them
The build path is well-understood. OpenTelemetry collector with tail-sampling processor (sample slow or errored traces at 100%, normal traces at 1%); Datadog APM, Honeycomb, Lightstep all expose trace queries that can drive alerts (define the alert as a query: count of traces matching a pattern over 5 minutes); alert rule example: more than 10 traces with span.duration > 2s on /checkout in 5 minutes.
- OTel collector with tail sampling. Slow/errored 100%, normal 1%; the sampling primitive.
- Vendor query support. Datadog APM, Honeycomb, Lightstep all expose trace queries.
- Alert as count query. Count of matching traces over 5 minutes; the rule shape.
- Per-rule example. > 10 traces with span.duration > 2s on /checkout in 5 minutes; concrete pattern.
What makes a trace alert sharper
Three patterns push trace alerts past metrics. Pattern-matching across spans (“auth span failed and was retried 3 times in the same trace” cannot be expressed in metric terms); per-customer-tier alerts (filter traces by tenant ID, fire only on enterprise tenants); cross-service cascades (if service A errors and service B retries 5x, fire a single alert at the cascade level).
- Cross-span patterns. “Auth failed, retried 3x in same trace”; metrics can’t express.
- Per-tier filtering. Fire only on enterprise tenants; per-customer SLO enforcement.
- Cross-service cascades. Service A errors plus service B retries 5x: single cascade alert.
- Per-pattern reusable rule. Each pattern shape becomes a rule template; supports growing the catalog.
The cost of trace alerts
Trace alerts are not free. Trace storage is expensive (tail sampling reduces volume but adds operational complexity, budget 10-20% of observability spend on traces); cardinality matters (per-tenant labels on spans are useful but explode quickly, cap labels you index for alerting); trace alerts evaluate slower than metric alerts (expect 1-3 minute lag versus 30-60 seconds for metrics).
- Trace storage expensive. 10-20% of observability spend; the budget item.
- Tail sampling adds complexity. Reduces volume but operational tax.
- Cardinality watch on tenant labels. Per-tenant labels useful; explode quickly; cap indexed labels.
- 1-3 minute evaluation lag. Slower than metric alerts (30-60s); plan for the latency.
When to add trace alerts
The threshold is scale and tier-based SLOs. Once your stack has more than 10 services or per-tenant SLOs (below that, metric alerts are enough); start with 3 alerts (one for the most painful regression you have seen, one for the highest-value endpoint, one for cross-service cascades); don’t replicate metric alerts in traces because traces are for what metrics cannot express.
- 10+ services or per-tenant SLOs. The investment threshold; below that metric alerts suffice.
- Start with 3 alerts. Painful regression, highest-value endpoint, cross-service cascade.
- Don’t replicate metric alerts. Traces for what metrics cannot express; not duplication.
- Per-quarter trace-alert review. Each new alert justified; supports continued discipline.