What distributed tracing is and the problem it solves
Distributed tracing follows a single request as it travels through every service, queue, cache, and database that touches it, recording each step so you can see exactly where time and errors went. The problem it solves is specific and brutal: in a modern architecture, one user click does not run as a single function you can step through with a debugger. It fans out into a dozen or more network calls across services owned by different teams, and when that request is slow or fails, no single component can tell you why, because each one only sees its own slice of the work.
Consider the everyday failure. A customer reports that checkout is slow. Your metrics confirm it: the p99 latency on the checkout endpoint jumped from 200ms to 1.4 seconds. But the checkout service itself looks healthy, its CPU is fine, its own logs show nothing alarming. The slowness is somewhere downstream, in one of the eight services checkout calls, or one of the services those services call. Without tracing, finding it means paging several teams, each of whom checks their own dashboards, each of whom reports "looks fine on my side," while the minutes tick by. With tracing, you open one request, read the waterfall, and see that a single call to the inventory service took 1.1 seconds because it was waiting on a slow database query. The investigation goes from a multi-team meeting to a glance.
This is the gap tracing fills that the other signals cannot. To place it precisely: tracing is one of the three pillars of observability, the broad capability to understand a system from the telemetry it emits. The other two pillars are metrics and logs. Observability is the parent practice and the right page to read first if you are building the whole stack; this guide is the in-depth deep-dive on the traces pillar specifically. Metrics tell you that something is wrong (latency is up). Logs tell you what happened on a single component (this query returned an error). Traces tell you where, across a distributed call path, the time and the errors actually went. Each is strong where the others are weak, and tracing is the only one that reconstructs the journey of a request across service boundaries. For the AI and LLM-specific version of this, where you trace prompts, token usage, and model calls, see the sibling guide to AI observability.
The value of tracing scales with the number of network hops per request. In a monolith, a single stack trace already shows you the full call path, so tracing adds little. The moment your architecture becomes distributed, tracing stops being optional polish and becomes the difference between a diagnosable system and an undiagnosable one. The rest of this guide is the full mechanism: the data model that records the journey, the propagation that keeps it connected, the instrumentation that produces it, the sampling that makes it affordable, how to read it, and how it drives faster resolution.
The data model: traces, spans, attributes, and links
To read traces fluently you need the vocabulary, and there are only a handful of concepts. Get these and the rest of distributed tracing follows.
The span: the atomic unit of work
The fundamental building block is the span: one named, timed operation. A span might represent "GET /checkout", "SELECT from orders", "publish to Kafka", or "call inventory service". Every span carries a small, fixed set of fields: a name (the operation), a start timestamp, a duration, a status (ok or error), a unique span ID, and a trace ID shared by every span in the same request. Spans are the rows that tracing is built out of, and almost everything else is a relationship between them.
The trace and the parent/child tree
A trace is the complete tree of spans for one request. It begins with a root span, the operation that entered your system, the inbound HTTP request. Every operation that root span triggers becomes a child span, and each child records the span ID of its parent. Those parent references are what assemble flat spans into a tree: the root has children, those children have their own children, and the nesting encodes causality and timing. Render that tree as a horizontal waterfall, with each span a bar positioned by its start time and sized by its duration, and you can read the whole request at a glance: which operations ran in sequence, which ran in parallel, and which one was the long pole.
Attributes, events, and links
Three more concepts add the detail that makes traces diagnostic rather than just structural. Attributes (sometimes called tags) are key/value pairs attached to a span: http.status_code=500, db.statement, customer.tier=enterprise, region=eu-west. Attributes are what let you filter and group traces after the fact, so choosing good ones is most of the craft of instrumentation. Events are timestamped annotations inside a single span, useful for marking a discrete moment such as an exception being thrown or a cache miss occurring partway through an operation. Span links connect spans across different traces, which matters for the cases the simple parent/child tree cannot express: a batch job whose work was triggered by many upstream requests, or a message consumed long after the producing trace finished. Links are how you keep causality visible when one trace fans into or out of another.
A worked example
Put it together with a concrete checkout request. The root span POST /checkout starts at time zero and lasts 1,400ms. Under it sit four child spans: validate-cart (20ms), call-inventory (1,150ms), call-payment (180ms), and write-order (40ms). The call-inventory span has one child of its own, SELECT FROM stock (1,100ms), carrying the attribute db.statement with the offending query and a status of error. Reading this trace top to bottom answers the entire incident in one picture: the request was slow because call-inventory dominated it, and call-inventory was slow because of a single database query that also errored. No log-hopping, no cross-team paging. The waterfall named the service, the call, and the line of evidence all at once.
Context propagation: how a trace stays connected
Spans are easy. The hard part of distributed tracing, the part that determines whether you have a real tracing system or a pile of disconnected fragments, is context propagation: making the spans from many independent services link into one trace.
The mechanism
The core idea is simple. When service A makes a call to service B, A injects two pieces of identity into the outbound request: the trace ID of the current request and the span ID of the span making the call. Service B reads that injected context, treats the incoming span ID as its parent, continues the same trace ID, and emits its own spans as children. Because every service does the same thing on every outbound call, the spans from a dozen services that never share a database all stitch together into one coherent tree, held together by nothing more than the trace context riding along in request headers.
W3C Trace Context and the traceparent header
For cross-process calls, the context has to travel inside the request itself, and the industry has standardized on exactly how. The W3C Trace Context specification defines the traceparent HTTP header, a compact string that carries the trace ID, the parent span ID, and a sampling flag. OpenTelemetry uses traceparent by default, which is why instrumentation from different vendors and languages interoperates: they all read and write the same header. Standardizing the wire format is what made tracing portable across a polyglot fleet, where one service is in Go, the next in Python, and the next in Java.
In-process vs cross-process, and the most common failure
Propagation happens at two scales. In-process propagation carries the current span across function calls and async boundaries within a single service, usually through a context object the SDK threads for you. Cross-process propagation carries it between services over the network via traceparent. Both have to hold for a trace to stay whole, and the single most common tracing failure in practice is a broken propagation chain: one hop that does not forward the context, so the trace splits into two unconnected fragments. The usual culprits are an HTTP client that was not instrumented, a message queue where context was not attached to the message, or a manually spawned thread that lost the in-process context. Getting propagation right on every hop, including async boundaries like queues and background jobs, is most of the real work of adopting tracing. When a trace looks suspiciously short, a dropped propagation hop is the first thing to check.
See your traces, logs, and metrics stitched into one request view, with root cause already identified.
Try Nova →Instrumentation: OpenTelemetry and the Collector
Instrumentation is the code that actually produces spans. You have two ways to generate it, and one standard you should generate it against.
Manual vs automatic instrumentation
Automatic instrumentation uses agents or libraries that hook into common frameworks, your web server, HTTP client, database driver, and message-queue library, and emit spans for you with zero code changes. It is how you get a useful trace on day one: install the auto-instrumentation for your language, and inbound requests, outbound calls, and database queries start producing spans automatically. Manual instrumentation is code you write to create spans around your own business logic and to attach the attributes that make traces searchable in your domain, the customer tier, the feature flag, the order ID. The right approach is both: auto-instrumentation for broad framework-level coverage cheaply, manual spans and attributes for the application-specific detail that turns a generic trace into a diagnostic one. Start with automatic, then add manual where the auto-generated spans leave gaps.
OpenTelemetry: the standard to instrument against
OpenTelemetry (OTel) is the open, vendor-neutral standard for generating and collecting traces, along with metrics and logs, and it has effectively won as the default. It gives you one tracing API and a set of SDKs to instrument against, auto-instrumentation libraries for common frameworks, the W3C context propagation described above out of the box, and a consistent data model across every language. The strategic payoff is portability: you instrument your code once against OTel, and your spans are not tied to any single backend. Before OTel, choosing a tracing vendor meant adopting their proprietary agent and getting locked in; with OTel, the instrumentation is yours and the backend is a swappable detail. If you are starting tracing in 2026, instrument with OpenTelemetry first and choose backends second.
The Collector: your control point
The OpenTelemetry Collector is a standalone service that sits between your instrumented applications and your tracing backend. It receives spans, processes them (batching, filtering, adding or redacting attributes, redacting sensitive data), applies sampling, and exports the result to one or more backends. The Collector matters because it is the single place where you shape and budget your telemetry without touching application code. Want to drop a noisy health-check span, redact a field, or switch backends? You change the Collector config, not every service. As the next section shows, the Collector is also where the most powerful form of sampling lives, which makes it the natural cost-control point for the whole tracing pipeline.
Sampling: head-based vs tail-based
Here is the constraint that shapes every real tracing deployment: you cannot keep every trace. A busy system generates an enormous volume of spans, and storing and querying all of them costs as much as running the system you are observing. Worse, the overwhelming majority of traces are uninteresting, successful requests that look exactly like every other successful request and carry almost no diagnostic value. Sampling is how you keep the rare, informative traces while discarding the redundant healthy majority, so you pay for signal instead of raw volume. There are two fundamental strategies, and the difference between them is one of the most consequential choices in tracing.
Head-based sampling
Head-based sampling decides whether to keep a trace at the very beginning, at the head, before the request has finished. The typical implementation keeps a fixed random percentage: trace 1% of requests, drop the rest, decided by a dice roll at the root span and propagated to every child via the sampling flag in traceparent. Its virtue is that it is cheap and simple, the decision is made once, up front, with no buffering, and it consumes almost no memory. Its fatal flaw is that it is blind: it decides before it knows whether the request errored or was slow, so it throws away errors and slow outliers just because they lost the dice roll. With 1% head sampling, you keep 1% of your errors too, which is exactly backwards from what you want.
Tail-based sampling
Tail-based sampling decides at the end, at the tail, after the request has completed and all its spans exist. The Collector buffers the spans of each in-flight trace until the trace finishes, then evaluates the whole thing and decides whether to keep it. Because it sees the complete trace, it can apply intelligent rules: keep every trace that contains an error, keep every trace slower than a latency threshold, and sample down the boring fast successes to a small percentage. This is what you actually want, full retention of the interesting traces, cheap sampling of the redundant ones. The cost is operational: the Collector must hold the spans of every active trace in memory until the trace is complete, which takes more resources and careful tuning than the stateless head-based approach. For most teams running real microservices, tail-based sampling at the Collector is worth that cost, because keeping 100% of errors and slow requests is the entire point of tracing.
The practical pattern. Many teams run both. Use light head-based sampling as a cheap floor for baseline visibility, then layer tail-based sampling at the Collector to guarantee that every error and every slow request is retained regardless of the dice roll. The principle is the same one that governs all observability cost: pay for the traces that teach you something, not for a warehouse full of identical successes.
Reading traces in practice
Instrumentation and sampling get traces into a backend. The payoff is reading them, and a handful of patterns cover most of what you will diagnose.
Find the latency in the critical path
The critical path is the chain of spans that determines a request's total duration, the longest sequence of dependent work from root to leaf. When a request is slow, the question is always "which span owns the time?" In the waterfall, that span is the long bar everything else nests inside or runs beside. Crucially, a span that runs in parallel with others does not extend total latency, so you ignore the wide-but-overlapping spans and zero in on the one that sits on the critical path. In the checkout example, call-inventory at 1,150ms is on the critical path and the other three children are noise; the fix targets inventory, not payment.
Spot the N+1 query
One of the most common and most satisfying things to find in a trace is the N+1 query: code that runs one query to fetch a list, then runs a separate query for each item in that list. In a waterfall it is unmistakable, a row of dozens or hundreds of near-identical short spans, each a few milliseconds, stacked in sequence so their sum dominates the request. No single one is slow, which is why metrics and logs miss it, but together they are the latency. The trace makes the pattern visible at a glance, and the fix, batch the queries into one, is usually a small code change with a large payoff.
Tell a slow dependency from a slow caller
Tracing resolves the most common finger-pointing argument in distributed systems: is service A slow, or is A just waiting on slow service B? The waterfall answers it directly. If A's span is long but almost all of that time is spent inside a child span calling B, then A is fine and B is the problem. If A's span is long but its child calls are all fast, the time is being spent inside A's own code, and the next step is a profiler to find the hot function. Tracing localizes the problem to the right service; profiling then localizes it to the right line.
Correlate the trace with logs and metrics
A trace is most powerful when it is not read alone. Because every span carries the shared trace ID, you can pivot from a single failing span straight to the logs emitted during that exact operation, filtered by trace ID, and to the metrics for the service over the same window. That correlation is the full evidence chain in one place: the metric spike told you something was wrong, the trace told you which span owned it, and the logs told you what that span actually did when it failed. Stitching the three pillars together by trace ID is what turns three separate tools into one coherent investigation, and it is exactly the correlation a good platform automates so you do not do it by hand mid-incident.
From traces to root cause and action
The point of all this machinery is not pretty waterfalls. It is faster, more certain resolution, and tracing accelerates two things specifically: root cause analysis and mean time to resolution.
Tracing transforms root cause analysis by collapsing the step that usually dominates an incident: localization. The hardest part of most investigations is not fixing the problem, it is figuring out which of your many services and calls is responsible. A trace answers that before the investigation even begins, because the waterfall names the responsible span. Root cause analysis that used to start from "the site is slow and we have no idea where" starts instead from "the inventory service's stock query is slow and erroring," which is most of the way to a fix. And by cutting the localization time, tracing directly reduces MTTR: instead of paging several teams to ask whose service is slow, the trace engages the right owner immediately, so resolution starts with the answer already half-found rather than after an hour of cross-team correlation.
This is where Nova AI Ops sits in the stack. Nova is not another place to store spans; it is the layer that consumes the traces you already produce with OpenTelemetry, alongside your metrics and logs, and turns them into resolved incidents. It correlates trace anomalies with the rest of your telemetry across AWS, GCP, Azure, Linux, and Windows in one model rather than five disconnected views, detects the anomalous span, ties a latency or error spike to the specific service and call responsible, assembles the supporting logs and metrics into one evidence chain, and auto-resolves routine incidents within a policy envelope you define. Tracing supplies the evidence; Nova reads the evidence, reaches the diagnosis, and acts on the well-understood cases so a human only sees the genuinely novel ones. For the broader categories this sits inside, see AIOps, AI incident response, and site reliability engineering. The relationship is complementary: you keep your OpenTelemetry instrumentation and your tracing backend of choice, and Nova converts their output into action.
A 90-day plan and a 10-point checklist
A practical sequence for rolling out distributed tracing that proves value early and avoids the two classic traps, broken propagation and runaway cost. The principle throughout: instrument one service end to end first, then scale.
Days 1-14: Instrument one service with OpenTelemetry
Pick one important service that makes downstream calls and instrument it end to end with OpenTelemetry. Turn on auto-instrumentation for its framework, HTTP client, and database driver, stand up a Collector, and point it at a tracing backend (open or commercial, your choice). The goal is narrow and concrete: produce a real trace and read it as a waterfall. One well-instrumented service teaches the team more than ten half-instrumented ones.
Days 15-45: Propagate context across the critical request path
Extend instrumentation to the services on your most important request path so traces span multiple hops. This is the phase where propagation is the whole battle: verify that traceparent is forwarded on every outbound call, including across message queues and background jobs, and hunt down any hop that breaks the trace into fragments. Add manual spans and attributes (customer tier, region, order ID) so traces are searchable in your domain. By the end of this phase, a real cross-service request should produce one connected trace with no gaps.
Days 46-75: Add tail-based sampling and control cost
Now make tracing affordable at scale. Configure tail-based sampling at the Collector so you keep 100% of errors and slow requests and sample the fast successes down to a small percentage. Audit your span volume, drop noisy low-value spans like health checks at the Collector, and confirm your tracing cost tracks signal rather than raw request count. This is the phase where a careless rollout starts getting expensive, so do the cost hygiene before you scale to the rest of the fleet.
Days 76-90: Correlate and connect to action
Wire traces into the rest of your workflow. Make sure you can pivot from a span to its logs and metrics by trace ID, so investigations use all three pillars together. Then connect the signal to an action layer: this is where a platform like Nova AI Ops consumes the traces to correlate anomalies, identify the responsible span, and auto-resolve routine incidents, so the tracing investment converts into faster resolution rather than just better waterfalls. Document the before/after MTTR to justify expanding coverage to the remaining services.
The 10-point distributed tracing checklist
Score yourself honestly. Each "yes" is a level of maturity; the gaps are your roadmap.
- Are your critical-path services instrumented? Spans on the inbound request, outbound calls, and database queries for the services that matter, not just one.
- Is your instrumentation built on OpenTelemetry? Vendor-neutral spans, so you can switch tracing backends without re-instrumenting.
- Does trace context propagate on every hop? Across every service call and async boundary, with no traces breaking into fragments.
- Are you using the W3C traceparent header? The standard wire format, so polyglot services interoperate without custom glue.
- Do your spans carry useful attributes? Domain fields like customer tier, region, and order ID, so traces are searchable by what you care about.
- Do you run a Collector as your control point? A central place to process, redact, sample, and route spans without touching application code.
- Is your sampling tail-based? You keep 100% of errors and slow requests, not a blind random percentage that drops the interesting ones.
- Is your tracing cost predictable? Sampling, span filtering, and retention tuned so cost tracks signal rather than raw request volume.
- Can you correlate a trace with its logs and metrics? Pivot from a failing span to the relevant logs and metrics by trace ID, without manual stitching.
- Does the trace signal drive action? Tracing feeds root cause analysis, faster MTTR, and ideally autonomous remediation, not just dashboards nobody reads.
Most teams that have adopted tracing sit around five or six of these, usually strong on instrumentation and weak on propagation coverage, sampling, and correlation. The gap between six and ten is where tracing stops being a feature you turned on and starts measurably cutting your time to resolution.
Frequently asked questions
What is distributed tracing?
What is a span in distributed tracing?
What is trace context propagation?
What is the difference between distributed tracing and observability?
How does OpenTelemetry relate to distributed tracing?
What is the difference between head-based and tail-based sampling?
Why can't I keep every trace?
How does distributed tracing help find latency problems?
How does distributed tracing reduce mean time to resolution?
How does Nova AI Ops use distributed traces?
Related guides
This page is the in-depth deep-dive on the traces pillar. For the parent practice and the other two pillars, start with observability (metrics, logs, and traces together), then the AI and LLM-specific angle in AI observability. On the rest of the telemetry stack: monitoring, the four golden signals, microservices monitoring, and anomaly detection. On turning trace evidence into resolution: root cause analysis, MTTR, AIOps, incident management, and AI incident response. On the reliability foundations and the action layer: site reliability engineering, AI SRE, Agentic SRE, self-healing infrastructure, and SLOs and error budgets. On the broader practice: DevOps, DevOps automation, and LLMOps. See the full platform on Nova features.
Turn your traces into resolved incidents.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. It consumes the traces you already produce with OpenTelemetry, correlates them with your logs and metrics across AWS, GCP, Azure, Linux, and Windows, ties each anomaly to the responsible span, and auto-resolves routine incidents within your policy envelope. Free tier available for small teams.