SRE Tools By Samson Tanimawo, PhD Published Jun 18, 2025 10 min read

Distributed Tracing for People Who Have Never Set It Up

Distributed tracing is simpler than the marketing makes it look. Here is the mental model, the minimum setup, and what to actually do with the data.

The mental model

A trace is a single user request's journey through your system. It is made of spans, each span is one unit of work (an HTTP call, a DB query, a cache lookup). Spans have parents and children; together they form a tree that shows who called what, how long each took, and where the time went.

That is the whole idea. Everything else is implementation detail.

Span vs trace vs context

Propagation is where setups usually break. If service A doesn't pass the traceparent header to service B, B starts a new trace and the tree breaks.

Minimum setup for one service

For a Node.js service with OpenTelemetry, the minimum is:

npm i @opentelemetry/sdk-node @opentelemetry/auto-instrumentations-node

// tracing.js
const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
new NodeSDK({ instrumentations: [getNodeAutoInstrumentations()] }).start();

// index.js
require('./tracing');
// ... rest of app

Auto-instrumentations wire HTTP, Express, common DB drivers, and outbound fetches. Zero manual span creation and you already see traces in whatever backend (Jaeger, Tempo, Honeycomb) you pointed the exporter at.

Sampling, honestly

You can't store every trace at any real scale. Head-based sampling (decide at trace start) is simple and cheap but misses tail incidents. Tail-based sampling (decide after the trace completes) keeps all the slow and errored ones but costs more infrastructure.

Sensible starter: 100% of errors, 100% of p99 slow traces, 1% random sample of the rest. Tune from there.

Three things tracing tells you that logs can't

  1. Where the time actually went. A 2-second p99 latency, broken down: 400ms in auth, 80ms in DB, 1.4s in a third-party API you forgot about.
  2. The shape of a fan-out. One incoming request turns into 14 outgoing calls; 12 were sequential when they could have been parallel.
  3. Which service caused the upstream error. Three services deep, one returned a 500 and the top service logged a 502. The trace points at the root.

These are the patterns logs cannot see. Once you have them, you won't want to debug prod without them.

You cannot store every trace at any real scale.

100%
of errored traces should be kept
1%
random sample of healthy traffic

Wiring propagation right the first time

The single most common broken-tracing setup is a service that does not forward the traceparent header on outbound calls. Every HTTP client library has a way to auto-forward; use it.

Test propagation by sending a request and checking that the resulting trace has the expected service chain. If any link is missing, the outbound header is dropped there.

Once propagation works, the rest of tracing is just deciding what to sample and where to store it. Get the propagation right before anything else.