Exemplars: The Missing Link Between Metrics and Traces
Metrics tell you something is slow. Traces tell you what was slow. Exemplars are the bridge: each metric data point can carry a trace ID, so you can jump from "latency is up" directly to a representative slow trace.
What an exemplar is
An exemplar is a metric data point with extra metadata attached: typically a trace ID and a few labels. When you see a latency p99 spike on a chart, exemplars let you click that point and jump to a real trace from that exact bucket.
The capability bridge. Metrics tell you something is slow (across millions of requests, p99 went up). Traces tell you why an individual slow request was slow. Exemplars let you go from one to the other in 30 seconds; without them, you spend 20 minutes searching for a representative slow trace.
The deeper benefit. Exemplars preserve the connection between aggregate signal and individual instance. The team that has them debugs at the speed of dashboards; the team that doesn't debugs at the speed of correlation queries.
How they work
The instrumentation library samples a small fraction of measurements (1 per histogram bucket per scrape interval) and includes the active trace ID. The TSDB stores it alongside the count. The frontend (Grafana, Datadog) renders it as a clickable dot.
The implementation detail. Every histogram bucket can hold one exemplar per scrape interval (default 30 seconds). The exemplar represents one observation that fell into that bucket. The selection is random; it's a representative sample, not a worst case.
The space cost. Minimal. One exemplar per bucket per scrape = ~120 exemplars/hour per histogram. Compared to millions of metric samples, this is negligible. Exemplars don't materially affect storage cost.
Dashboards that exploit exemplars
Latency histograms and error-rate charts. The dashboard renders exemplars as small dots overlaid on the chart line. Click → trace view → root cause in 30 seconds instead of 30 minutes.
The Grafana flow. The histogram chart shows exemplars as dots over the line graph. Hover over a dot to see the trace ID. Click to deep-link into the tracing UI (Tempo, Jaeger, etc.). The trace UI loads with the trace ID pre-populated; the engineer is looking at the actual slow request within seconds.
The Datadog flow. Similar pattern; the implementation differs but the user experience is the same, click a metric data point, jump to a trace. Most modern observability platforms support this.
The config most teams forget
Your TSDB ingest needs --enable-feature=exemplar-storage (Prometheus) or equivalent. Most teams enable exemplar emission on the SDK side and forget the storage flag. The exemplars get dropped silently and nobody notices.
The silent failure. The application emits exemplars correctly. The Collector forwards them. The TSDB receives them and drops them silently because exemplar storage isn't enabled. The dashboards don't show exemplars; the engineers assume "we don't have exemplars set up." The flag fix takes 30 seconds.
The verification. Query the TSDB for exemplars on a known histogram. If results are empty after enabling SDK-side emission, the storage flag is missing. Most teams discover this only when they notice their dashboards lack exemplars.
When exemplars earn their keep
Any team with both metrics and tracing should turn exemplars on. The cost is one config flag and ~5% additional storage. The savings is every incident where you skip the "find a representative trace" step.
The ROI math. A team running 50 incidents per year, each requiring 5-15 minutes of "find a representative trace" effort, saves 4-12 hours/year per engineer. Multiplied across the on-call rotation, the savings are real. The implementation cost is one engineer-hour total. ROI is overwhelming.
The teams that benefit most. High-traffic services where any individual trace is hard to find. Multi-service systems where the trace path crosses 5+ services. Microservices in general. Each is a setting where "find me a slow trace" is the bottleneck.
Where exemplars shine
The slow-tail debug. Latency p99 spiked at 14:32. With exemplars, click the dot at 14:32 in the p99 line; jump to a representative trace from that minute; see exactly which service was slow. Without exemplars, search for traces in the time window, filter by latency, hope a representative trace was sampled.
The error-rate investigation. Error rate spiked at 09:15. With exemplars, click the spike; see a representative failed trace; understand the failure mode in seconds. Without exemplars, search logs/traces, correlate with the metric spike, sometimes spend 30 minutes finding a useful failure trace.
The customer report. Customer says "my requests were slow yesterday." With exemplars on a per-tenant histogram, click the customer's tier; see traces from their requests; understand their specific experience. Without exemplars, search traces by customer ID, hope they were sampled.
Common antipatterns
Emitting without storing. SDK is configured to emit exemplars; TSDB drops them. The most common failure mode; check end-to-end after enabling.
Exemplars on metrics that don't need them. Counter metrics without histograms don't benefit. Don't emit exemplars on raw counters; only histograms have buckets that exemplars meaningfully label.
Trace retention shorter than metric retention. Click an exemplar from 30 days ago; trace is no longer available because trace retention is 7 days. Match retention windows; if metrics are kept 30 days, traces should be too (or accept old exemplars are dead links).
Exemplars without trace context propagation. Exemplars exist but the trace IDs don't lead to coherent traces because trace context wasn't propagated across services. Fix the trace propagation; the exemplars are useless without it.
What to do this week
Three moves. (1) Verify exemplar storage is enabled in your TSDB. Most teams haven't checked. (2) Pick your most-used latency dashboard. Confirm exemplars render on the panels. If not, the storage flag, SDK config, or tracing context propagation has a gap. (3) Train the on-call team on the exemplar workflow. The investment isn't useful until engineers know to click the dots.