Token Usage as a First-Class Observability Signal

Token spikes precede most agent failures by several seconds. The dashboard panels, the alerts, and the auto-throttle policy that uses token signal as a leading indicator.

Why tokens are leading indicator

Token usage is the early-warning signal. Most agent failures show up as token spikes before they show up as latency or cost spikes (the token signal is several seconds ahead); token spikes precede loops, prompt-size explosions, and runaway tool chains (by the time cost shows up on the dashboard, the failure is already happening); treat tokens as a SLO-relevant metric, not just a cost line.

Dashboard panels

Three panels make the signal visible. Tokens per run as a histogram (watch for the long tail; the p99 is where the failures live); tokens per minute by agent role as a time series (spikes are visible at a glance); tokens per tool call as a distribution (outliers are tools that are returning huge payloads, cap them).

Alerts

Three alert classes catch most failures. Threshold alert “any run > 50k tokens” (page on rate, not on individual events); anomaly alert “per-minute token rate > 3x baseline” (catches collective failures: model regression, prompt-bug rollout); per-tenant alert “single tenant tokens/minute > 10x median tenant” (catches a stuck integration on one customer).

Auto-throttle on token signal

Auto-throttle is conservative protection. When per-tenant rate exceeds threshold, the agent service rate-limits that tenant (the customer’s runs queue or fail fast); auto-throttle preserves availability for other customers at the cost of immediate availability for the spiking customer; the throttle is communicated (customer sees a clear message, operator gets a notification, throttling is loud not silent).

Post-incident review

The post-incident review tightens the three layers. After every token-related incident, review the timeline (did the dashboard surface it? did alerts fire? did auto-throttle help?); each review tightens one of the three layers (dashboard gets a panel; alert gets calibrated; throttle gets refined); target by quarter four no token incident should surprise the team because the signal is reliable and the response is automated.