Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 16, 2026 5 min read

Building an Eval Harness for Incident Triage Agents

An eval harness is half the engineering. The schema, the runner, the scoring rubric, and the regression dashboard, with code you can lift directly.

Schema for an eval case

Each case is a YAML document with named fields: id, description, input (alert payload + context), expected (hypothesis, action, confidence range), tags (regression, edge, capability). The schema is enforced; missing fields fail the loader.

The id is stable across versions. When a case fires, the id is what you grep for in the codebase to understand what it was protecting against. Stable ids = useful history.

Tags drive subset selection. "Run only regression cases" or "run only edge cases" lets you slice the suite by purpose. CI runs the full suite; ad-hoc runs use slices.

The runner

Iterate the suite, run each case in parallel up to a worker cap, collect outputs into a structured result file. The runner is 80 lines of Python. Reuse it across agents.

Each case has a hard timeout. Hit the timeout, fail the case. No partial credit, no warnings. Determinism beats charity in eval.

Output goes to JSON: per-case result, aggregated scores, deltas vs the last run on main. JSON makes downstream tooling (CI checks, dashboards) trivial.

Scoring rubric

Three dimensions: hypothesis match (does the agent's primary hypothesis match expected), action match (does the proposed action match expected), confidence in range (does the agent self-report confidence within the expected range).

Scores are binary per dimension: pass or fail. No partial scores; partial scoring tempts subjective grading and weakens the regression signal.

Aggregate to per-case pass-rate (3-of-3, 2-of-3, etc.). The dashboard shows the aggregate; the per-case detail is one click away.

Regression dashboard

The dashboard shows: pass-rate over time per agent, pass-rate per tag, recent regressions (cases that flipped red). One screen, three charts. No more.

Ship the dashboard as a static HTML file generated by CI. No external service, no auth, no fragility. The runner publishes it; the team reads it.

Link the dashboard from every PR comment that touches the agent. The reviewer reads the eval delta as part of the review, not as an afterthought.

Scaling the harness

First milestone: 12 cases, the day-one suite. Second milestone: 50 cases, after the first month of production. Third milestone: 200 cases, after a year.

Resist case explosion. Cases that overlap in coverage are noise; collapse them. The valuable suite is small and high-signal.

Periodically prune: any case that has not flipped red in 12 months is providing little signal. Move it to an archive folder; keep the active suite tight.

What to do this week

Stand up the harness for one agent: 12 starter cases, the runner, the scoring, the dashboard. Run it on the next PR that touches the agent. The reviewer should be able to read the eval delta in 30 seconds; if they cannot, the dashboard needs work.