Eval-Driven Development for Production Agents
Tests-first works for code. Evals-first works for agents. The workflow that keeps quality compounding instead of regressing every prompt tweak.
The flip from code to evals
In code, you write a test, then write the code that makes the test pass. In agents, you write an eval case, then write the prompt that makes the eval pass. Same workflow, different artefact.
Eval-first feels slow at the start because writing a good case takes thought. It pays off compounding because every prompt change is now backed by tests.
The discipline is hard. Engineers want to iterate on the prompt and check by eye. The reflex of "write the eval first" needs to be enforced by code review until it becomes muscle memory.
The eval-first flow
Step one: receive a bug report or a feature request. Step two: write the eval case that captures the desired behaviour. Step three: confirm the case fails (proves the case is meaningful). Step four: change the prompt until the case passes. Step five: confirm no regressions.
Each step is small. Each step is committed separately. The PR is a chain of small, reviewable changes.
Steps three and five are non-negotiable. Skipping step three lets you ship a case that already passes (no protection). Skipping step five lets regressions through.
Smells that signal you are not eval-driven
PR descriptions like "improved triage prompt" with no eval delta attached. Improvement should be measurable; if it is not, you do not actually know it improved.
Eval cases written after the prompt change. The case fits the prompt rather than the prompt fitting the case. This is theatre, not testing.
Eval cases that always pass. Either the cases are too easy or the prompt is over-fit to the cases. Periodically remove easy cases and add harder ones.
Why this compounds
An eval-first agent, after six months, has 80+ cases. The cases protect against regressions every time the prompt is touched. The prompt is robust by construction.
An eval-after agent, after six months, has 12 cases (the day-one suite). Every prompt change is a leap of faith. Quality drifts unpredictably.
The gap widens over time. By year two, the eval-first agent is in a different reliability regime than the eval-after one.
Making it a team norm
Code review checklist item: "is there an eval case attached?" If not, request one. Be ruthless about this; one exception breaks the discipline.
Pair the eval-writer and the prompt-writer when possible. Two perspectives produce better cases and prevents the eval from being trivially easy.
Celebrate eval coverage as much as feature shipping. "Agent X added 5 new eval cases this week" is changelog-worthy.