Three Eval Categories Every SRE Agent Needs
Capability evals. Safety evals. Cost evals. Why all three, what goes in each, and the failure modes of having only the first.
Category 1: capability evals
Capability evals are the category every team starts with. They are necessary but not sufficient on their own; a capable agent can still be unsafe or expensive.
- Question. Can the agent do the thing it is supposed to do? Triage this alert correctly, identify this cause, recommend this action.
- Most teams stop here. Capability evals are necessary but not sufficient. An agent can be capable and unsafe at the same time.
- Target size. Aim for around 40 capability cases covering common cases and the edges. More is fine; fewer is risky.
- Pin the cases. Treat capability cases as load-bearing tests. Removing one requires the same scrutiny as removing a unit test.
Category 2: safety evals
Safety evals catch the failures that capability evals miss. A capable agent that overreaches passes capability and fails safety; both are required.
- Question. When the agent should not act, does it refuse? When it lacks information, does it say so? When it sees something outside its scope, does it escalate?
- Silent capability failure. A capable agent that overreaches passes capability and fails safety. The two categories cannot substitute.
- Failure-class coverage. Aim for around 15 safety cases. Each represents a class: false-positive action, hallucinated tool result, scope creep, missing-data refusal, escalation triggers.
- Adversarial cases. Include adversarial prompts designed to provoke unsafe behaviour. Safety evals are the right place to ship them.
Category 3: cost evals
Cost evals are the early-warning system for cost drift. Without them, prompt growth and model changes silently blow the per-run budget.
- Question. Does the agent stay within token, latency, and dollar budgets? A capable, safe agent that costs $5 per run is not production-ready.
- Loud regressions. Cost evals fail loudly when prompt growth or model changes push the budget. The early warning is what prevents the cost-bomb shape.
- Target size. Aim for around 5 cost cases at p50, p95, and worst case workloads. Smaller than the other categories on purpose.
- Latency too. Track latency alongside dollar cost. A 2x slowdown shows up here before the SLO dashboard sees it.
How the three interact
The three categories trade against each other. Optimising one in isolation produces fragile agents; the diff vector across all three is what the reviewer reads.
- Capability vs cost. Capability gains often hurt cost. Reviewers see both numbers and trade explicitly.
- Safety vs capability. Hardening safety often costs some capability. The trade is acceptable when documented; surprising when hidden.
- Run all three on every PR. The diff vector across all three is the unit of review. One-category green is not enough.
- Explicit thresholds. “Capability cannot regress, safety cannot regress, cost can grow up to 10 percent.” Thresholds make trade-offs explicit instead of hidden.
Why three is the right number
Three is empirically the right granularity. The convergence across major platforms suggests the shape, not the names, is what matters.
- Two is not enough. Merging safety into capability hides safety failures behind capability passes; the two categories must stay separate.
- Four is too many. The fourth category is usually a sub-case of the first three. Splitting it produces overhead without new signal.
- Five-plus is padding. Schemes with five or more categories look thorough but are usually padding. Stick to three; add a fourth only when a class of failures consistently slips through.
- Industry convergence. Most production-grade agent platforms (LangSmith, Braintrust, Humanloop) settle on similar three-category splits. The convergence suggests the shape is right.