Token Usage as a First-Class Observability Signal
Token spikes precede most agent failures by several seconds. The dashboard panels, the alerts, and the auto-throttle policy that uses token signal as a leading indicator.
Why tokens are leading indicator
Token usage is the early-warning signal. Most agent failures show up as token spikes before they show up as latency or cost spikes (the token signal is several seconds ahead); token spikes precede loops, prompt-size explosions, and runaway tool chains (by the time cost shows up on the dashboard, the failure is already happening); treat tokens as a SLO-relevant metric, not just a cost line.
- Tokens precede latency and cost. Several seconds ahead; the early-warning surface.
- Token spikes precede failures. Loops, prompt explosions, runaway tool chains.
- Cost is lagging. By the time it shows on the dashboard, failure is already happening.
- SLO-relevant metric. Tokens are a reliability signal; not just a cost line.
Dashboard panels
Three panels make the signal visible. Tokens per run as a histogram (watch for the long tail; the p99 is where the failures live); tokens per minute by agent role as a time series (spikes are visible at a glance); tokens per tool call as a distribution (outliers are tools that are returning huge payloads, cap them).
- Tokens per run histogram. Long tail at p99 is where failures live.
- Tokens per minute by role. Time series; spikes visible at a glance.
- Tokens per tool call distribution. Outliers are tools returning huge payloads; cap them.
- Per-panel review cadence. Weekly check; supports continued attention.
Alerts
Three alert classes catch most failures. Threshold alert “any run > 50k tokens” (page on rate, not on individual events); anomaly alert “per-minute token rate > 3x baseline” (catches collective failures: model regression, prompt-bug rollout); per-tenant alert “single tenant tokens/minute > 10x median tenant” (catches a stuck integration on one customer).
- Threshold > 50k tokens. Page on rate, not on individual events.
- Anomaly > 3x baseline. Catches collective failures: regressions, bug rollouts.
- Per-tenant > 10x median. Catches stuck integration on one customer.
- Per-class threshold tuning. Each class calibrated; supports the right firing rate.
Auto-throttle on token signal
Auto-throttle is conservative protection. When per-tenant rate exceeds threshold, the agent service rate-limits that tenant (the customer’s runs queue or fail fast); auto-throttle preserves availability for other customers at the cost of immediate availability for the spiking customer; the throttle is communicated (customer sees a clear message, operator gets a notification, throttling is loud not silent).
- Per-tenant rate limit. Threshold exceeded triggers limit; runs queue or fail fast.
- Conservative trade-off. Other customers protected at cost of spiking customer’s availability.
- Loud throttling. Customer sees message; operator notified; not silent.
- Per-tenant override path. Documented escalation if throttle is wrong; supports correct response.
Post-incident review
The post-incident review tightens the three layers. After every token-related incident, review the timeline (did the dashboard surface it? did alerts fire? did auto-throttle help?); each review tightens one of the three layers (dashboard gets a panel; alert gets calibrated; throttle gets refined); target by quarter four no token incident should surprise the team because the signal is reliable and the response is automated.
- Three-layer review. Dashboard surfaced; alerts fired; auto-throttle helped.
- Each review tightens one layer. Dashboard panel, alert calibration, throttle refinement.
- Q4 target: no surprises. Signal reliable; response automated.
- Per-incident learning compounds. Each review improves the next response.