Distributed Tracing: The Complete Guide to Spans, Context, and Sampling (2026)

What distributed tracing is and the problem it solves

Distributed tracing follows a single request as it travels through every service, queue, cache, and database that touches it, recording each step so you can see exactly where time and errors went. The problem it solves is specific and brutal: in a modern architecture, one user click does not run as a single function you can step through with a debugger. It fans out into a dozen or more network calls across services owned by different teams, and when that request is slow or fails, no single component can tell you why, because each one only sees its own slice of the work.

Consider the everyday failure. A customer reports that checkout is slow. Your metrics confirm it: the p99 latency on the checkout endpoint jumped from 200ms to 1.4 seconds. But the checkout service itself looks healthy, its CPU is fine, its own logs show nothing alarming. The slowness is somewhere downstream, in one of the eight services checkout calls, or one of the services those services call. Without tracing, finding it means paging several teams, each of whom checks their own dashboards, each of whom reports "looks fine on my side," while the minutes tick by. With tracing, you open one request, read the waterfall, and see that a single call to the inventory service took 1.1 seconds because it was waiting on a slow database query. The investigation goes from a multi-team meeting to a glance.

This is the gap tracing fills that the other signals cannot. To place it precisely: tracing is one of the three pillars of observability, the broad capability to understand a system from the telemetry it emits. The other two pillars are metrics and logs. Observability is the parent practice and the right page to read first if you are building the whole stack; this guide is the in-depth deep-dive on the traces pillar specifically. Metrics tell you that something is wrong (latency is up). Logs tell you what happened on a single component (this query returned an error). Traces tell you where, across a distributed call path, the time and the errors actually went. Each is strong where the others are weak, and tracing is the only one that reconstructs the journey of a request across service boundaries. For the AI and LLM-specific version of this, where you trace prompts, token usage, and model calls, see the sibling guide to AI observability.

The value of tracing scales with the number of network hops per request. In a monolith, a single stack trace already shows you the full call path, so tracing adds little. The moment your architecture becomes distributed, tracing stops being optional polish and becomes the difference between a diagnosable system and an undiagnosable one. The rest of this guide is the full mechanism: the data model that records the journey, the propagation that keeps it connected, the instrumentation that produces it, the sampling that makes it affordable, how to read it, and how it drives faster resolution.

The data model: traces, spans, attributes, and links

To read traces fluently you need the vocabulary, and there are only a handful of concepts. Get these and the rest of distributed tracing follows.

The span: the atomic unit of work

The fundamental building block is the span: one named, timed operation. A span might represent "GET /checkout", "SELECT from orders", "publish to Kafka", or "call inventory service". Every span carries a small, fixed set of fields: a name (the operation), a start timestamp, a duration, a status (ok or error), a unique span ID, and a trace ID shared by every span in the same request. Spans are the rows that tracing is built out of, and almost everything else is a relationship between them.

The trace and the parent/child tree

A trace is the complete tree of spans for one request. It begins with a root span, the operation that entered your system, the inbound HTTP request. Every operation that root span triggers becomes a child span, and each child records the span ID of its parent. Those parent references are what assemble flat spans into a tree: the root has children, those children have their own children, and the nesting encodes causality and timing. Render that tree as a horizontal waterfall, with each span a bar positioned by its start time and sized by its duration, and you can read the whole request at a glance: which operations ran in sequence, which ran in parallel, and which one was the long pole.

Attributes, events, and links

Three more concepts add the detail that makes traces diagnostic rather than just structural. Attributes (sometimes called tags) are key/value pairs attached to a span: http.status_code=500, db.statement, customer.tier=enterprise, region=eu-west. Attributes are what let you filter and group traces after the fact, so choosing good ones is most of the craft of instrumentation. Events are timestamped annotations inside a single span, useful for marking a discrete moment such as an exception being thrown or a cache miss occurring partway through an operation. Span links connect spans across different traces, which matters for the cases the simple parent/child tree cannot express: a batch job whose work was triggered by many upstream requests, or a message consumed long after the producing trace finished. Links are how you keep causality visible when one trace fans into or out of another.

A worked example

Put it together with a concrete checkout request. The root span POST /checkout starts at time zero and lasts 1,400ms. Under it sit four child spans: validate-cart (20ms), call-inventory (1,150ms), call-payment (180ms), and write-order (40ms). The call-inventory span has one child of its own, SELECT FROM stock (1,100ms), carrying the attribute db.statement with the offending query and a status of error. Reading this trace top to bottom answers the entire incident in one picture: the request was slow because call-inventory dominated it, and call-inventory was slow because of a single database query that also errored. No log-hopping, no cross-team paging. The waterfall named the service, the call, and the line of evidence all at once.

Context propagation: how a trace stays connected

Spans are easy. The hard part of distributed tracing, the part that determines whether you have a real tracing system or a pile of disconnected fragments, is context propagation: making the spans from many independent services link into one trace.

The mechanism

The core idea is simple. When service A makes a call to service B, A injects two pieces of identity into the outbound request: the trace ID of the current request and the span ID of the span making the call. Service B reads that injected context, treats the incoming span ID as its parent, continues the same trace ID, and emits its own spans as children. Because every service does the same thing on every outbound call, the spans from a dozen services that never share a database all stitch together into one coherent tree, held together by nothing more than the trace context riding along in request headers.

W3C Trace Context and the traceparent header

For cross-process calls, the context has to travel inside the request itself, and the industry has standardized on exactly how. The W3C Trace Context specification defines the traceparent HTTP header, a compact string that carries the trace ID, the parent span ID, and a sampling flag. OpenTelemetry uses traceparent by default, which is why instrumentation from different vendors and languages interoperates: they all read and write the same header. Standardizing the wire format is what made tracing portable across a polyglot fleet, where one service is in Go, the next in Python, and the next in Java.

In-process vs cross-process, and the most common failure

Propagation happens at two scales. In-process propagation carries the current span across function calls and async boundaries within a single service, usually through a context object the SDK threads for you. Cross-process propagation carries it between services over the network via traceparent. Both have to hold for a trace to stay whole, and the single most common tracing failure in practice is a broken propagation chain: one hop that does not forward the context, so the trace splits into two unconnected fragments. The usual culprits are an HTTP client that was not instrumented, a message queue where context was not attached to the message, or a manually spawned thread that lost the in-process context. Getting propagation right on every hop, including async boundaries like queues and background jobs, is most of the real work of adopting tracing. When a trace looks suspiciously short, a dropped propagation hop is the first thing to check.

See your traces, logs, and metrics stitched into one request view, with root cause already identified.

Try Nova →

Instrumentation: OpenTelemetry and the Collector

Instrumentation is the code that actually produces spans. You have two ways to generate it, and one standard you should generate it against.

Manual vs automatic instrumentation

Automatic instrumentation uses agents or libraries that hook into common frameworks, your web server, HTTP client, database driver, and message-queue library, and emit spans for you with zero code changes. It is how you get a useful trace on day one: install the auto-instrumentation for your language, and inbound requests, outbound calls, and database queries start producing spans automatically. Manual instrumentation is code you write to create spans around your own business logic and to attach the attributes that make traces searchable in your domain, the customer tier, the feature flag, the order ID. The right approach is both: auto-instrumentation for broad framework-level coverage cheaply, manual spans and attributes for the application-specific detail that turns a generic trace into a diagnostic one. Start with automatic, then add manual where the auto-generated spans leave gaps.

OpenTelemetry: the standard to instrument against

OpenTelemetry (OTel) is the open, vendor-neutral standard for generating and collecting traces, along with metrics and logs, and it has effectively won as the default. It gives you one tracing API and a set of SDKs to instrument against, auto-instrumentation libraries for common frameworks, the W3C context propagation described above out of the box, and a consistent data model across every language. The strategic payoff is portability: you instrument your code once against OTel, and your spans are not tied to any single backend. Before OTel, choosing a tracing vendor meant adopting their proprietary agent and getting locked in; with OTel, the instrumentation is yours and the backend is a swappable detail. If you are starting tracing in 2026, instrument with OpenTelemetry first and choose backends second.

The Collector: your control point

The OpenTelemetry Collector is a standalone service that sits between your instrumented applications and your tracing backend. It receives spans, processes them (batching, filtering, adding or redacting attributes, redacting sensitive data), applies sampling, and exports the result to one or more backends. The Collector matters because it is the single place where you shape and budget your telemetry without touching application code. Want to drop a noisy health-check span, redact a field, or switch backends? You change the Collector config, not every service. As the next section shows, the Collector is also where the most powerful form of sampling lives, which makes it the natural cost-control point for the whole tracing pipeline.

Sampling: head-based vs tail-based

Here is the constraint that shapes every real tracing deployment: you cannot keep every trace. A busy system generates an enormous volume of spans, and storing and querying all of them costs as much as running the system you are observing. Worse, the overwhelming majority of traces are uninteresting, successful requests that look exactly like every other successful request and carry almost no diagnostic value. Sampling is how you keep the rare, informative traces while discarding the redundant healthy majority, so you pay for signal instead of raw volume. There are two fundamental strategies, and the difference between them is one of the most consequential choices in tracing.

Head-based sampling

Head-based sampling decides whether to keep a trace at the very beginning, at the head, before the request has finished. The typical implementation keeps a fixed random percentage: trace 1% of requests, drop the rest, decided by a dice roll at the root span and propagated to every child via the sampling flag in traceparent. Its virtue is that it is cheap and simple, the decision is made once, up front, with no buffering, and it consumes almost no memory. Its fatal flaw is that it is blind: it decides before it knows whether the request errored or was slow, so it throws away errors and slow outliers just because they lost the dice roll. With 1% head sampling, you keep 1% of your errors too, which is exactly backwards from what you want.

Tail-based sampling

Tail-based sampling decides at the end, at the tail, after the request has completed and all its spans exist. The Collector buffers the spans of each in-flight trace until the trace finishes, then evaluates the whole thing and decides whether to keep it. Because it sees the complete trace, it can apply intelligent rules: keep every trace that contains an error, keep every trace slower than a latency threshold, and sample down the boring fast successes to a small percentage. This is what you actually want, full retention of the interesting traces, cheap sampling of the redundant ones. The cost is operational: the Collector must hold the spans of every active trace in memory until the trace is complete, which takes more resources and careful tuning than the stateless head-based approach. For most teams running real microservices, tail-based sampling at the Collector is worth that cost, because keeping 100% of errors and slow requests is the entire point of tracing.

The practical pattern. Many teams run both. Use light head-based sampling as a cheap floor for baseline visibility, then layer tail-based sampling at the Collector to guarantee that every error and every slow request is retained regardless of the dice roll. The principle is the same one that governs all observability cost: pay for the traces that teach you something, not for a warehouse full of identical successes.

Reading traces in practice

Instrumentation and sampling get traces into a backend. The payoff is reading them, and a handful of patterns cover most of what you will diagnose.

Find the latency in the critical path

The critical path is the chain of spans that determines a request's total duration, the longest sequence of dependent work from root to leaf. When a request is slow, the question is always "which span owns the time?" In the waterfall, that span is the long bar everything else nests inside or runs beside. Crucially, a span that runs in parallel with others does not extend total latency, so you ignore the wide-but-overlapping spans and zero in on the one that sits on the critical path. In the checkout example, call-inventory at 1,150ms is on the critical path and the other three children are noise; the fix targets inventory, not payment.

Spot the N+1 query

One of the most common and most satisfying things to find in a trace is the N+1 query: code that runs one query to fetch a list, then runs a separate query for each item in that list. In a waterfall it is unmistakable, a row of dozens or hundreds of near-identical short spans, each a few milliseconds, stacked in sequence so their sum dominates the request. No single one is slow, which is why metrics and logs miss it, but together they are the latency. The trace makes the pattern visible at a glance, and the fix, batch the queries into one, is usually a small code change with a large payoff.

Tell a slow dependency from a slow caller

Tracing resolves the most common finger-pointing argument in distributed systems: is service A slow, or is A just waiting on slow service B? The waterfall answers it directly. If A's span is long but almost all of that time is spent inside a child span calling B, then A is fine and B is the problem. If A's span is long but its child calls are all fast, the time is being spent inside A's own code, and the next step is a profiler to find the hot function. Tracing localizes the problem to the right service; profiling then localizes it to the right line.

Correlate the trace with logs and metrics

A trace is most powerful when it is not read alone. Because every span carries the shared trace ID, you can pivot from a single failing span straight to the logs emitted during that exact operation, filtered by trace ID, and to the metrics for the service over the same window. That correlation is the full evidence chain in one place: the metric spike told you something was wrong, the trace told you which span owned it, and the logs told you what that span actually did when it failed. Stitching the three pillars together by trace ID is what turns three separate tools into one coherent investigation, and it is exactly the correlation a good platform automates so you do not do it by hand mid-incident.

From traces to root cause and action

The point of all this machinery is not pretty waterfalls. It is faster, more certain resolution, and tracing accelerates two things specifically: root cause analysis and mean time to resolution.

Tracing transforms root cause analysis by collapsing the step that usually dominates an incident: localization. The hardest part of most investigations is not fixing the problem, it is figuring out which of your many services and calls is responsible. A trace answers that before the investigation even begins, because the waterfall names the responsible span. Root cause analysis that used to start from "the site is slow and we have no idea where" starts instead from "the inventory service's stock query is slow and erroring," which is most of the way to a fix. And by cutting the localization time, tracing directly reduces MTTR: instead of paging several teams to ask whose service is slow, the trace engages the right owner immediately, so resolution starts with the answer already half-found rather than after an hour of cross-team correlation.

This is where Nova AI Ops sits in the stack. Nova is not another place to store spans; it is the layer that consumes the traces you already produce with OpenTelemetry, alongside your metrics and logs, and turns them into resolved incidents. It correlates trace anomalies with the rest of your telemetry across AWS, GCP, Azure, Linux, and Windows in one model rather than five disconnected views, detects the anomalous span, ties a latency or error spike to the specific service and call responsible, assembles the supporting logs and metrics into one evidence chain, and auto-resolves routine incidents within a policy envelope you define. Tracing supplies the evidence; Nova reads the evidence, reaches the diagnosis, and acts on the well-understood cases so a human only sees the genuinely novel ones. For the broader categories this sits inside, see AIOps, AI incident response, and site reliability engineering. The relationship is complementary: you keep your OpenTelemetry instrumentation and your tracing backend of choice, and Nova converts their output into action.

A 90-day plan and a 10-point checklist

A practical sequence for rolling out distributed tracing that proves value early and avoids the two classic traps, broken propagation and runaway cost. The principle throughout: instrument one service end to end first, then scale.

Days 1-14: Instrument one service with OpenTelemetry

Pick one important service that makes downstream calls and instrument it end to end with OpenTelemetry. Turn on auto-instrumentation for its framework, HTTP client, and database driver, stand up a Collector, and point it at a tracing backend (open or commercial, your choice). The goal is narrow and concrete: produce a real trace and read it as a waterfall. One well-instrumented service teaches the team more than ten half-instrumented ones.

Days 15-45: Propagate context across the critical request path

Extend instrumentation to the services on your most important request path so traces span multiple hops. This is the phase where propagation is the whole battle: verify that traceparent is forwarded on every outbound call, including across message queues and background jobs, and hunt down any hop that breaks the trace into fragments. Add manual spans and attributes (customer tier, region, order ID) so traces are searchable in your domain. By the end of this phase, a real cross-service request should produce one connected trace with no gaps.

Days 46-75: Add tail-based sampling and control cost

Now make tracing affordable at scale. Configure tail-based sampling at the Collector so you keep 100% of errors and slow requests and sample the fast successes down to a small percentage. Audit your span volume, drop noisy low-value spans like health checks at the Collector, and confirm your tracing cost tracks signal rather than raw request count. This is the phase where a careless rollout starts getting expensive, so do the cost hygiene before you scale to the rest of the fleet.

Days 76-90: Correlate and connect to action

Wire traces into the rest of your workflow. Make sure you can pivot from a span to its logs and metrics by trace ID, so investigations use all three pillars together. Then connect the signal to an action layer: this is where a platform like Nova AI Ops consumes the traces to correlate anomalies, identify the responsible span, and auto-resolve routine incidents, so the tracing investment converts into faster resolution rather than just better waterfalls. Document the before/after MTTR to justify expanding coverage to the remaining services.

The 10-point distributed tracing checklist

Score yourself honestly. Each "yes" is a level of maturity; the gaps are your roadmap.

Are your critical-path services instrumented? Spans on the inbound request, outbound calls, and database queries for the services that matter, not just one.
Is your instrumentation built on OpenTelemetry? Vendor-neutral spans, so you can switch tracing backends without re-instrumenting.
Does trace context propagate on every hop? Across every service call and async boundary, with no traces breaking into fragments.
Are you using the W3C traceparent header? The standard wire format, so polyglot services interoperate without custom glue.
Do your spans carry useful attributes? Domain fields like customer tier, region, and order ID, so traces are searchable by what you care about.
Do you run a Collector as your control point? A central place to process, redact, sample, and route spans without touching application code.
Is your sampling tail-based? You keep 100% of errors and slow requests, not a blind random percentage that drops the interesting ones.
Is your tracing cost predictable? Sampling, span filtering, and retention tuned so cost tracks signal rather than raw request volume.
Can you correlate a trace with its logs and metrics? Pivot from a failing span to the relevant logs and metrics by trace ID, without manual stitching.
Does the trace signal drive action? Tracing feeds root cause analysis, faster MTTR, and ideally autonomous remediation, not just dashboards nobody reads.

Most teams that have adopted tracing sit around five or six of these, usually strong on instrumentation and weak on propagation coverage, sampling, and correlation. The gap between six and ten is where tracing stops being a feature you turned on and starts measurably cutting your time to resolution.

Frequently asked questions

What is distributed tracing?

Distributed tracing follows a single request as it travels through every service, queue, cache, and database that touches it. Each unit of work is recorded as a span, spans are linked by a shared trace context that propagates in request headers, and together they form a tree that shows exactly where time was spent and where errors occurred. In a microservices architecture it is the only signal that can attribute a slow or failed request to the specific downstream call responsible, which no single log or metric can do.

What is a span in distributed tracing?

A span is one named, timed operation inside a trace, for example an inbound HTTP request, a database query, or a message publish. Each span carries a start time, a duration, a status, a set of key/value attributes, and a reference to its parent span. Spans nest into a tree: the root span is the request entering your system, and every operation it triggers becomes a child span. Reading the spans as a waterfall is how you see where latency accumulated.

What is trace context propagation?

Trace context propagation is the mechanism that ties spans from different services into one trace. When service A calls service B, it injects the trace ID and current span ID into the outbound request headers, B reads them and continues the same trace, and B's spans link back to A's. The W3C Trace Context standard defines the traceparent header that OpenTelemetry uses for this. Propagation must hold across every hop, including async boundaries like queues, because a single missing hop breaks the trace into disconnected fragments.

What is the difference between distributed tracing and observability?

Observability is the broad capability to understand a system from its telemetry, and it rests on three pillars: metrics, logs, and traces. Distributed tracing is one of those three pillars, the one that follows a request across service boundaries. Metrics tell you something is wrong, logs tell you what happened on one component, and traces tell you where in a distributed call path the time and errors went. Tracing is a deep specialty inside the wider observability practice, not a replacement for it.

How does OpenTelemetry relate to distributed tracing?

OpenTelemetry is the open, vendor-neutral standard for generating and collecting traces, along with metrics and logs. It provides the tracing API and SDKs you instrument against, auto-instrumentation for common frameworks, the W3C context propagation it uses by default, and a Collector that receives, processes, samples, and exports spans to any backend. Instrumenting tracing with OpenTelemetry means your spans are portable: you can switch or run multiple tracing backends without re-instrumenting your code.

What is the difference between head-based and tail-based sampling?

Head-based sampling decides whether to keep a trace at the very start, before the request completes, usually by keeping a fixed random percentage. It is cheap and simple but blind, so it drops some errors and slow requests just because they lost the dice roll. Tail-based sampling buffers the spans of a trace until it finishes, then decides using the whole trace, so it can keep every error and every slow request and sample the boring successes. Tail-based is more valuable but needs the Collector to hold spans in memory until the trace is complete.

Why can't I keep every trace?

At scale, tracing every request produces an enormous volume of span data, and storing and querying all of it costs as much as the system it observes. The vast majority of traces are uninteresting successful requests that look like every other successful request, so keeping them all wastes money for almost no diagnostic value. Sampling exists to keep the rare and informative traces, the errors and the slow outliers, while discarding the redundant healthy majority, so you pay for signal rather than raw volume.

How does distributed tracing help find latency problems?

A trace rendered as a waterfall shows the duration of every span and how they nest, so the critical path, the chain of spans that determines total latency, is visible at a glance. You can see a single database call eating 600ms of an 800ms request, spot an N+1 query as a row of near-identical repeated spans, and tell a slow dependency apart from a slow caller. That turns a slow-endpoint investigation from hours of cross-team log correlation into reading one picture.

How does distributed tracing reduce mean time to resolution?

Tracing collapses the localization step that dominates incident time. Instead of paging several teams to ask whose service is slow, the trace waterfall names the responsible service and the responsible call, so the right owner is engaged immediately and root cause analysis starts with the answer already half-found. Correlating the failing span with its logs and the surrounding metrics gives the full evidence chain in one place, which is exactly what shortens mean time to resolution.

How does Nova AI Ops use distributed traces?

Nova AI Ops consumes the traces you already produce with OpenTelemetry, alongside your metrics and logs, and correlates them across AWS, GCP, Azure, Linux, and Windows in one model rather than five disconnected views. It detects anomalous spans, ties a latency or error spike to the specific span and service responsible, assembles the supporting logs and metrics into one evidence chain, and auto-resolves routine incidents within a policy envelope you define. Tracing supplies the evidence; Nova reads it, reaches the diagnosis, and acts.

This page is the in-depth deep-dive on the traces pillar. For the parent practice and the other two pillars, start with observability (metrics, logs, and traces together), then the AI and LLM-specific angle in AI observability. On the rest of the telemetry stack: monitoring, the four golden signals, microservices monitoring, and anomaly detection. On turning trace evidence into resolution: root cause analysis, MTTR, AIOps, incident management, and AI incident response. On the reliability foundations and the action layer: site reliability engineering, AI SRE, Agentic SRE, self-healing infrastructure, and SLOs and error budgets. On the broader practice: DevOps, DevOps automation, and LLMOps. See the full platform on Nova features.

Turn your traces into resolved incidents.

Nova AI Ops is the Multi Agent Operating System for SRE, DevOps, and Reliability Teams. It consumes the traces you already produce with OpenTelemetry, correlates them with your logs and metrics across AWS, GCP, Azure, Linux, and Windows, ties each anomaly to the responsible span, and auto-resolves routine incidents within your policy envelope. Free tier available for small teams.

Try Nova → Read the observability guide

Distributed Tracing: The Complete Guide to Spans, Context, and Sampling (2026)

◆ Trace · POST /checkout · t=9f3a1c · 842ms

◆ Slowest spans

◆ Span tags

What distributed tracing is and the problem it solves

The data model: traces, spans, attributes, and links

The span: the atomic unit of work

The trace and the parent/child tree

Attributes, events, and links

A worked example

Context propagation: how a trace stays connected

The mechanism

W3C Trace Context and the traceparent header

In-process vs cross-process, and the most common failure

Instrumentation: OpenTelemetry and the Collector

Manual vs automatic instrumentation

OpenTelemetry: the standard to instrument against

The Collector: your control point

Sampling: head-based vs tail-based

Head-based sampling

Tail-based sampling

Reading traces in practice

Find the latency in the critical path

Spot the N+1 query

Tell a slow dependency from a slow caller

Correlate the trace with logs and metrics

From traces to root cause and action

A 90-day plan and a 10-point checklist

Days 1-14: Instrument one service with OpenTelemetry

Days 15-45: Propagate context across the critical request path

Days 46-75: Add tail-based sampling and control cost

Days 76-90: Correlate and connect to action

The 10-point distributed tracing checklist

Frequently asked questions

Turn your traces into resolved incidents.

◆ Trace · POST /checkout · t=9f3a1c · 842ms

◆ Slowest spans

◆ Span tags

What distributed tracing is and the problem it solves

The data model: traces, spans, attributes, and links

The span: the atomic unit of work

The trace and the parent/child tree

Attributes, events, and links

A worked example

Context propagation: how a trace stays connected

The mechanism

W3C Trace Context and the traceparent header

In-process vs cross-process, and the most common failure

Instrumentation: OpenTelemetry and the Collector

Manual vs automatic instrumentation

OpenTelemetry: the standard to instrument against

The Collector: your control point

Sampling: head-based vs tail-based

Head-based sampling

Tail-based sampling

Reading traces in practice

Find the latency in the critical path

Spot the N+1 query

Tell a slow dependency from a slow caller

Correlate the trace with logs and metrics

From traces to root cause and action

A 90-day plan and a 10-point checklist

Days 1-14: Instrument one service with OpenTelemetry

Days 15-45: Propagate context across the critical request path

Days 46-75: Add tail-based sampling and control cost

Days 76-90: Correlate and connect to action

The 10-point distributed tracing checklist

Frequently asked questions

Related guides

Turn your traces into resolved incidents.