Analytics API: Exact Time-Window Attribution
Multi-request traces now report exact time-window attribution. Here's the query model, the JSON shape that comes back, and the dashboards that fall out of it for free.
The attribution problem
A trace can span ten services, three databases, and a queue. When you ask "where did the time go?" the honest answer is rarely "service X took 280ms", it's "service X held the request for 280ms but spent 240ms of that waiting on service Y." The old analytics API conflated those two and rewarded whoever was closest to the user with all the blame. Engineers got tired of losing arguments to numbers that were technically correct and operationally misleading.
Exact time-window attribution fixes the math. Every span in a trace gets a "self time" (work done in this span) and "wait time" (work done in spans this span was awaiting). When you query for service-level attribution over a window, the response splits both. Service X's contribution to a 500ms p99 might be 40ms self plus 460ms blocking on Y, which is a very different story than "X is slow."
The query model
The API takes four required parameters: start and end (RFC3339 timestamps), group_by (one of service, operation, endpoint, tenant), and aggregate (one of p50, p95, p99, sum, count). Optional filters layer on top, service tag, operation name, error class, deployment ID. The query language is JSON over POST; we considered a SQL-like dialect and decided it added complexity without adding power for the queries that actually run.
A representative request looks like: group by service, aggregate p99 latency, filter to environment=prod, time window 1 hour, with self/wait split enabled. Response is a sorted array of services with their p99 self time, p99 wait time, p99 total, and count of contributing spans. We always return sorted-by-total descending; if you want a different sort, you sort client-side.
The window can be sub-second-precise. We've had teams pull 30-second windows during active incidents to see exactly which service was the bottleneck during a specific spike. The query latency for a 30-second window is under 100ms; for a 24-hour window, under 800ms; for a 7-day window, under 4 seconds. All measured at p95 across our largest tenant.
JSON response shape
The response is intentionally flat and easy to feed into anything. Top level: window (with start, end, seconds), group_by, aggregate, and results. Each result has key (the group-by value), self_ms, wait_ms, total_ms, count, and contribution_pct (this group's share of total time across all groups in the window).
The contribution_pct field is the one that makes attribution conversations short. "Service X is responsible for 47% of total wait time in this window" is unambiguous. Engineers stop arguing about whose fault it is and start arguing about how to fix it. That's the conversation we want.
Pagination is cursor-based; large result sets stream as NDJSON if you set format=ndjson. We default to JSON for the first 200 results and require explicit pagination beyond that. Most queries don't need pagination because most group-by fields have low cardinality.
Performance characteristics
The query path uses pre-aggregated rollups for windows over 5 minutes and live span data for windows under 5 minutes. The handoff is invisible to callers, same JSON shape, same semantics, just faster for the long windows. The rollups are computed every 30 seconds in a background pipeline; recent data is always within 30 seconds of being queryable as a rollup.
Cardinality control is on you. Group-by service gives you ~50-200 results in a typical tenant; group-by endpoint can give you tens of thousands. We cap the response at 10,000 result rows by default and surface a warning if your query truncates. The cap is configurable up to 100,000 if you really need it.
Filters are pushdown-enabled. A filter like service=billing-api reduces the working set before aggregation; a filter like error_class=* doesn't. Order your filters most-selective-first; the engine doesn't reorder them for you because reordering can change which index gets used and we'd rather be predictable than clever.
Free dashboards
Three dashboards fall out of the API for free. The first is a service-attribution dashboard, top 10 services by total time, with self vs wait split as stacked bars. The second is an endpoint-hot-list, top 50 endpoints by p99, filtered to the last hour, refreshing every 30 seconds. The third is a tenant-attribution dashboard for multi-tenant services, which tenant is consuming the most service time right now.
All three ship as templates in Dashboard Studio. They're built from the same API the docs describe; nothing custom under the hood. If you want to clone one and adjust the filters or aggregates, the source manifest is one click away. We've found teams adjust the time windows and the group-by axis far more often than the rest of the layout.
The detail that makes the most difference in practice: the self/wait split. Engineers who used to spend incident calls arguing "no, we're not slow, the database is slow" can now point to a number that says exactly that. Total time minus wait time equals self time; if self time is small, the service is fast and the dependency is the issue. The API just makes it explicit.