The Golden Run Pattern for Agent Eval Suites
Ten incidents with hand-validated correct outputs become your golden runs. The pattern, the maintenance burden, and the protection it gives against silent regressions.
What a golden run is
A golden run is an eval case where the expected output is fully validated by humans, recorded in detail, and treated as authoritative. The agent's output is compared to the golden run; deviations are bugs.
Golden runs are expensive to create (each one represents hours of human review). They are cheap to maintain (mostly automatic). They are invaluable for regression detection.
Aim for 10 golden runs per agent. Below that, coverage is thin. Above that, maintenance cost grows faster than benefit.
Creating a golden run
Pick a real incident. Have the team's most senior on-call walk through it: what they checked, what they concluded, what they recommended. Capture every step.
Convert the walkthrough into structured expected output: hypotheses (with confidence), actions (with reasoning), tools called (with args). Each field is fully specified.
Validate by replaying against current production agent. Where the agent disagrees, decide: agent is wrong (golden stands), agent is right (golden updates), both reasonable (golden tags as alternative-correct).
Maintaining over time
Re-validate annually. The team's understanding of the system changes; what was correct a year ago might not be correct now.
When a golden run starts firing red consistently, do not just change the golden. Investigate first. Either the system changed (update golden), the prompt regressed (fix prompt), or the case has lost relevance (retire).
Track golden run age. Cases older than 18 months are suspect. Cases older than 3 years are usually retired.
What golden runs protect against
Silent regressions. The agent still passes 95% of regular eval cases but fails the golden run. The golden run is the high-bar case that catches subtle regressions normal cases miss.
Model swaps. Switching from one model to another usually moves all metrics; the golden run tells you whether the swap is acceptable on the highest-stakes cases.
Prompt simplification. Refactoring the prompt should not change behaviour. The golden run confirms it does not.
Budget for golden runs
Each golden run costs roughly 3-5 senior engineer-hours upfront. 10 golden runs is 30-50 hours, spread across a team.
Maintenance is 1 hour per golden run per quarter on average. Twelve hours per quarter for the suite. This is sustainable; do not skimp.
If the budget is too high, you have too many golden runs. Cut the redundant ones. Quality over quantity; this is the rule for golden runs more than any other eval category.