Token Usage as a First-Class Observability Signal

Token spikes precede most agent failures by several seconds. The dashboard panels, the alerts, and the auto-throttle policy that uses token signal as a leading indicator.

Why tokens are leading indicator

Token usage is the early-warning signal. Most agent failures show up as token spikes before they show up as latency or cost spikes (the token signal is several seconds ahead); token spikes precede loops, prompt-size explosions, and runaway tool chains (by the time cost shows up on the dashboard, the failure is already happening); treat tokens as a SLO-relevant metric, not just a cost line.

Tokens precede latency and cost. Several seconds ahead; the early-warning surface.
Token spikes precede failures. Loops, prompt explosions, runaway tool chains.
Cost is lagging. By the time it shows on the dashboard, failure is already happening.
SLO-relevant metric. Tokens are a reliability signal; not just a cost line.

Dashboard panels

Three panels make the signal visible. Tokens per run as a histogram (watch for the long tail; the p99 is where the failures live); tokens per minute by agent role as a time series (spikes are visible at a glance); tokens per tool call as a distribution (outliers are tools that are returning huge payloads, cap them).

Tokens per run histogram. Long tail at p99 is where failures live.
Tokens per minute by role. Time series; spikes visible at a glance.
Tokens per tool call distribution. Outliers are tools returning huge payloads; cap them.
Per-panel review cadence. Weekly check; supports continued attention.

Alerts

Three alert classes catch most failures. Threshold alert “any run > 50k tokens” (page on rate, not on individual events); anomaly alert “per-minute token rate > 3x baseline” (catches collective failures: model regression, prompt-bug rollout); per-tenant alert “single tenant tokens/minute > 10x median tenant” (catches a stuck integration on one customer).

Threshold > 50k tokens. Page on rate, not on individual events.
Anomaly > 3x baseline. Catches collective failures: regressions, bug rollouts.
Per-tenant > 10x median. Catches stuck integration on one customer.
Per-class threshold tuning. Each class calibrated; supports the right firing rate.

Auto-throttle on token signal

Auto-throttle is conservative protection. When per-tenant rate exceeds threshold, the agent service rate-limits that tenant (the customer’s runs queue or fail fast); auto-throttle preserves availability for other customers at the cost of immediate availability for the spiking customer; the throttle is communicated (customer sees a clear message, operator gets a notification, throttling is loud not silent).

Per-tenant rate limit. Threshold exceeded triggers limit; runs queue or fail fast.
Conservative trade-off. Other customers protected at cost of spiking customer’s availability.
Loud throttling. Customer sees message; operator notified; not silent.
Per-tenant override path. Documented escalation if throttle is wrong; supports correct response.

Post-incident review

The post-incident review tightens the three layers. After every token-related incident, review the timeline (did the dashboard surface it? did alerts fire? did auto-throttle help?); each review tightens one of the three layers (dashboard gets a panel; alert gets calibrated; throttle gets refined); target by quarter four no token incident should surprise the team because the signal is reliable and the response is automated.

Three-layer review. Dashboard surfaced; alerts fired; auto-throttle helped.
Each review tightens one layer. Dashboard panel, alert calibration, throttle refinement.
Q4 target: no surprises. Signal reliable; response automated.
Per-incident learning compounds. Each review improves the next response.