Token Usage as a First-Class Observability Signal
Token spikes precede most agent failures by several seconds. The dashboard panels, the alerts, and the auto-throttle policy that uses token signal as a leading indicator.
Why tokens are leading indicator
Most agent failures show up as token spikes before they show up as latency or cost spikes. The token signal is several seconds ahead.
Token spikes precede loops, prompt-size explosions, and runaway tool chains. By the time the cost shows up on the dashboard, the failure is already happening.
Treat tokens as a SLO-relevant metric, not just a cost line.
Dashboard panels
Tokens per run: histogram. Watch for the long tail; the p99 is where the failures live.
Tokens per minute by agent role: time series. Spikes are visible at a glance.
Tokens per tool call: distribution. Outliers are tools that are returning huge payloads; cap them.
Alerts
Threshold alert: "any run > 50k tokens." Page on rate, not on individual events.
Anomaly alert: "per-minute token rate > 3x baseline." Catches collective failures (model regression, prompt-bug rollout).
Per-tenant alert: "single tenant tokens/minute > 10x median tenant." Catches a stuck integration on one customer.
Auto-throttle on token signal
When per-tenant rate exceeds threshold, the agent service rate-limits that tenant. The customer's runs queue or fail fast.
Auto-throttle is conservative: it preserves availability for other customers at the cost of immediate availability for the spiking customer.
The throttle is communicated. The customer sees a clear message; the operator gets a notification. Throttling is loud, not silent.
Post-incident review
After every token-related incident, review the timeline. Did the dashboard surface it? Did alerts fire? Did auto-throttle help?
Each review tightens one of the three layers. The dashboard gets a panel; the alert gets calibrated; the throttle gets refined.
Targets: by quarter four, no token incident should surprise the team. The signal is reliable; the response is automated.