Distributed Tracing for People Who Have Never Set It Up
Distributed tracing is simpler than the marketing makes it look. Here is the mental model, the minimum setup, and what to actually do with the data.
The mental model
A trace is a single user request's journey through your system. It is made of spans, each span is one unit of work (an HTTP call, a DB query, a cache lookup). Spans have parents and children; together they form a tree that shows who called what, how long each took, and where the time went.
That is the whole idea. Everything else is implementation detail.
Span vs trace vs context
- Trace: one end-to-end request, identified by a trace ID.
- Span: one operation inside the trace (e.g., “GET /checkout”, “db.query”, “stripe.charge”), identified by a span ID and a parent span ID.
- Context propagation: the act of passing the trace ID and parent span ID across service boundaries via HTTP headers (W3C traceparent).
Propagation is where setups usually break. If service A doesn't pass the traceparent header to service B, B starts a new trace and the tree breaks.
Minimum setup for one service
For a Node.js service with OpenTelemetry, the minimum is:
npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node
// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
new NodeSDK({ instrumentations: [getNodeAutoInstrumentations()] }).start();
// index.js
require('./tracing');
// ... rest of app
Auto-instrumentations wire HTTP, Express, common DB drivers, and outbound fetches. Zero manual span creation and you already see traces in whatever backend (Jaeger, Tempo, Honeycomb) you pointed the exporter at.
Sampling, honestly
You can't store every trace at any real scale. Head-based sampling (decide at trace start) is simple and cheap but misses tail incidents. Tail-based sampling (decide after the trace completes) keeps all the slow and errored ones but costs more infrastructure.
Sensible starter: 100% of errors, 100% of p99 slow traces, 1% random sample of the rest. Tune from there.
Three things tracing tells you that logs can't
- Where the time actually went. A 2-second p99 latency, broken down: 400ms in auth, 80ms in DB, 1.4s in a third-party API you forgot about.
- The shape of a fan-out. One incoming request turns into 14 outgoing calls; 12 were sequential when they could have been parallel.
- Which service caused the upstream error. Three services deep, one returned a 500 and the top service logged a 502. The trace points at the root.
These are the patterns logs cannot see. Once you have them, you won't want to debug prod without them.
You cannot store every trace at any real scale.
Wiring propagation right the first time
The single most common broken-tracing setup is a service that does not forward the traceparent header on outbound calls. Every HTTP client library has a way to auto-forward; use it.
Test propagation by sending a request and checking that the resulting trace has the expected service chain. If any link is missing, the outbound header is dropped there.
Once propagation works, the rest of tracing is just deciding what to sample and where to store it. Get the propagation right before anything else.