Three Eval Categories Every SRE Agent Needs
Capability evals. Safety evals. Cost evals. Why all three, what goes in each, and the failure modes of having only the first.
Category 1: capability evals
Capability evals ask: can the agent do the thing it is supposed to do? Triage this alert correctly. Identify this cause. Recommend this action.
Most teams build only capability evals. They are necessary but not sufficient. An agent can be capable and unsafe at the same time.
Aim for ~40 capability cases covering the common cases and the edges. More is fine; fewer is risky.
Category 2: safety evals
Safety evals ask: when the agent should NOT act, does it refuse? When it lacks information, does it say so? When it sees something outside its scope, does it escalate?
Safety failures are silent in capability evals. A capable agent that overreaches passes capability and fails safety. You need both.
Aim for ~15 safety cases. Each one represents a class of failure: false-positive action, hallucinated tool result, scope creep, missing-data refusal, escalation triggers.
Category 3: cost evals
Cost evals ask: does the agent stay within token, latency, and dollar budgets? A capable, safe agent that costs $5 per run is not production-ready.
Cost evals fail loudly when prompt growth or model changes blow the budget. They are the early-warning system for cost drift.
Aim for ~5 cost cases representing typical workloads at p50, p95, and worst case.
How the three interact
A change that improves capability often hurts cost. A change that hardens safety often hurts capability. The three categories surface the trade.
Run all three on every PR. The diff vector across all three is what the reviewer reads. Optimising one category in isolation produces fragile agents.
Set explicit thresholds: "capability cannot regress, safety cannot regress, cost can grow up to 10%." The thresholds make trade-offs explicit instead of hidden.
Why three is the right number
Two is not enough; merging safety into capability hides safety failures behind capability passes. Four is too many; the additional category is usually a sub-case of the first three.
Five+ category schemes look thorough but are usually padding. Stick to three; add a fourth only when a class of failures consistently slips through.
Most production-grade agent platforms (LangSmith, Braintrust, Humanloop) settle on similar three-category splits. The convergence suggests the shape is right.