Metrics vs Traces for Performance
Different views.
Overview
Metrics and traces are not competing observability signals; they are complementary views of the same system. Metrics show aggregate behaviour over time at low cost; traces show per-request detail at higher cost. Performance investigation almost always uses both: metrics narrow the question, traces answer it. Treating either one as “the” observability signal leaves the other half of the problem invisible.
- Two different views of the same system. Metrics aggregate; traces follow individual requests. Both are needed; neither is sufficient.
- Metrics for aggregate trends. Per-service latency, error rate, throughput over time. The dashboards on-call watches.
- Traces for per-request detail. The end-to-end path of a single request through the system. The data that explains why p99 spiked.
- Combined investigation plus per-tier budget. Metrics narrow the question, traces resolve it; per-tier observability budget keeps both affordable.
The approach
Three habits produce fast root cause: metrics for trends, traces for investigation, and the discipline to use both together rather than picking a favourite.
- Metrics for trends. Per-service latency, error rate, throughput on the standing dashboard. The view operations starts every shift with.
- Traces for investigation. Per-request path with span timing. The data that turns “p99 is high” into “this database call is slow”.
- Combined investigation flow. Metrics narrow which service or endpoint is misbehaving; traces explain why.
- Per-tier observability budget plus documented strategy. Sampling and retention tuned per tier; per-team the observability strategy lives in the runbook.
Why this compounds
Each combined investigation deepens the team’s observability fluency. The patterns transfer between services; new services inherit the metric/trace conventions instead of recreating them.
- Faster root cause. Right signal for the question cuts MTTR on the recurring incident classes.
- Cost efficiency. Sampling traces and aggregating metrics keeps observability spend matched to value.
- Engineering culture shifts. Investigation moves from guessing to evidence. PR reviews start citing trace data.
- Year-one investment, year-two habit. First combined investigation is heavy lift. By year two, the metric-then-trace flow is muscle memory.