Latency Budgets for Production Agents
Triage agents should respond in seconds. Remediation in minutes. Postmortem in hours. The latency budget per agent type and how to enforce it without hurting quality.
Default budgets by role
Latency budgets are role-specific because operator patience varies wildly between “the page just fired” and “the report runs overnight.” Set the budget against the role, not the model.
- Triage. p95 under 6 seconds. The on-call is staring at the screen; long latency erodes the value proposition immediately.
- Investigation. p95 under 60 seconds. Acceptable when the agent is doing iterative reasoning across many tools.
- Postmortem drafting. p95 under 5 minutes. Nobody is waiting on the bar; quality matters more than speed.
- Audit reports. p95 under 1 hour. Background workload; the operator wants results in the morning, not in real time.
Enforcing the budget
Without enforcement, budgets are aspirations. The four controls below convert the SLO into actual runtime behaviour.
- Hard timeout per role. Triage agents are killed at 30 seconds, 5x the p95. The hard timeout is the safety net that protects upstream SLOs.
- Soft warning at p95. Log a slow-run event when a run crosses the soft threshold. The slow runs route to a debug queue for review.
- Per-step latency caps. A single hanging tool call should not consume the whole run’s budget. Each tool gets its own timeout, much smaller than the run-level cap.
- Per-role concurrency caps. Limit how many simultaneous runs a role can issue. Saturation is a latency tax; capping concurrency keeps p99 from blowing out under load.
What to optimise when budgets miss
When p95 drifts above the budget, four levers usually explain it. Pull them in roughly this order; the cheap wins are at the top.
- Prompt size. Smaller prompts produce faster responses. Trim unused context, retired few-shot examples, and stale system instructions.
- Cache hits. Prompt caching cuts cold-cache latency dramatically. Verify the hit rate and re-order static blocks to the top so the cache key is stable.
- Tool latency. A single slow tool call dominates the run. Optimise the slowest tool first; the rest only matters once the worst offender is below threshold.
- Model choice. A smaller model often produces good-enough quality at much lower latency. Re-evaluate periodically; the optimal pick changes as new model versions ship.
Track budgets in production
Latency degrades silently until it does not. The dashboards below catch the drift weeks before an operator complains.
- Per-agent latency dashboard. p50, p95, and p99 over time. Trend lines across weeks tell the story; single days are too noisy.
- Per-step latency breakdown. Helps identify the bottleneck step in slow runs. Without it, you only know the total.
- Per-tool latency. Tools degrade silently as upstream services age; the dashboard catches it before the agent runs do.
- Budget burn-rate alert. Page when the share of runs exceeding p95 climbs above 10 percent over a 30-minute window. Slow drift is what kills SLOs.
When to trade quality for latency
The trade is product-specific. Document the chosen direction so the eval team and the prompt team optimise for the same target.
- Triage: yes. The on-call cannot wait; a 90 percent accurate triage in 4 seconds is more valuable than a 95 percent accurate one in 12 seconds.
- Postmortem: no. Wait the extra minute for higher quality. Nobody is paged on the latency of a postmortem draft.
- Document the choice. “Triage agent prefers latency over the last 5 percent of accuracy” is a written decision, not a default. Pin it next to the eval set.
- Re-visit quarterly. Cheaper, faster models change the math. Trade-offs that made sense last quarter may not survive the next benchmark cycle.