Evaluating an SRE Agent: 12 Test Cases You Need on Day One
The 12 incident replay tests every SRE agent should pass before it touches production. Each test, the failure mode it catches, and how to score it.
Test 1-3: alert replay accuracy
Pull three real alerts the agent should know how to handle. For each, check that the agent identifies the affected service correctly, names a plausible cause, and proposes an investigation path that matches what the human on-call did. Three cases is enough on day one; you will add more as you find them.
Score each case yes / partial / no. The bar for ship is yes on at least two and at worst partial on the third. A flat-out no on day one is a sign the model has the wrong context, not that the model is wrong.
Rerun the cases on every prompt change. The whole suite should run in under two minutes. If it takes longer, the harness is too heavy; trim it.
Test 4-6: misclassification cases
Pick three alerts the agent should NOT act on, expired test alerts, alerts for services the agent does not own, alerts that are duplicates of an existing incident. The agent should refuse, explain, and exit cleanly.
Refusal is a feature; test it explicitly. An agent that confidently triages an alert it should have skipped is dangerous. The misclassification cases are how you catch over-eager defaults.
Add a misclassification case after every false-positive in production. The eval suite is your memory; do not rely on the team's memory.
Test 7-9: scale and edge cases
An alert with a 5MB payload. An alert with no metric data attached. An alert with a UTF-8 service name. The agent should handle each gracefully, either processing or refusing with a clear reason. Crashes are fail.
Edge cases are where you find input-handling bugs. The model itself rarely fails on these; the surrounding code does. Test the wrapper, not just the model.
Keep the cases narrow. "Empty payload" and "corrupt JSON" are different cases. Each isolates one input variant.
Test 10-11: regression cases
Two cases that previous prompt versions handled correctly but a candidate version got wrong. These are your regression sentinels. They should never go red without a deliberate decision.
When a regression case fires red, the prompt change is rejected by default. Either revert, or explicitly accept the regression with a written justification.
Add a regression case every time you fix a bug. Today's fix is tomorrow's regression test. Compounding tests is how the suite stays useful.
Test 12: cost and latency budget
One run with strict budgets: under $0.05 in token cost, under 8 seconds wall-clock. The agent should complete within these. If it cannot, the prompt or the loop is over-spec'd.
Budget tests catch slow-creep. Each prompt addition costs a few tokens; the cumulative drift is invisible until the budget test catches it. Run on every PR.
Tighten budgets quarterly. As models get cheaper and faster, the budget should shrink. A test that passes too easily is providing no signal.
What to do this week
Set up these 12 tests for one of your agents. They should run in CI on every PR. The first time the suite catches a regression you would have shipped is the moment you understand why this matters.