Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 13, 2026 5 min read

Regression Detection for Agent Behavior Changes

Every prompt change is a deploy. The diff that tells you whether the new agent is better, worse, or differently broken than the last one.

The diff that matters

Compare two runs of the eval suite: one on the current main, one on the candidate. Cases that flipped from pass to fail are regressions. Cases that flipped from fail to pass are improvements. Cases that flipped both ways need investigation.

The diff is computed at the case level, not the aggregate. A 5% aggregate regression hides 5 cases going red and 0 going green; you need to know which.

Surface the diff on the PR. The reviewer sees the regression list inline, with the case description and the agent's wrong output. The reviewer can decide to accept, reject, or request a fix.

Strict by default

Default policy: any case that flips red blocks the merge. The contributor either fixes the prompt, removes the case, or writes a justification that an approver signs off on.

Strict mode trains discipline. Without it, regressions accumulate quietly. Each one seems small; the cumulative drift is large.

The escape valve is the override, with a written reason. Overrides are logged and reviewed monthly. Repeat overrides on the same case is a signal to fix the case or the prompt for real.

Distinguish noise from regression

Stochastic models produce occasional wrong answers. A case that fails 10% of the time is noisy, not regressed. Run the eval N times and use majority vote per case.

N=3 catches most noise without tripling the cost. Cases that pass 2-of-3 are pass; cases that fail 2-of-3 are fail. Cases that split 1-2 or 2-1 get re-run.

Track noise per case. If the noise rate is consistently above 20%, the case is poorly designed; either the input is ambiguous or the expected output is too rigid. Fix the case.

Multi-dimensional regressions

A prompt change might improve hypothesis quality and degrade cost. A change might fix one regression but cause two new ones. The eval scorer needs to track each dimension separately.

Display the change as a vector: hypothesis +3 cases, action -1 case, cost +12%, latency -200ms. The reviewer reads the vector, not a single number.

Aggregate scores can lie; vector diffs cannot. Pay the small cost of multi-dimensional reporting. It catches the trades you would otherwise miss.

When to ship anyway

Sometimes a regression is acceptable. The new prompt fixes a bug worth more than the regression costs. The override is the right answer here.

Document the trade. "Accepted regression on case-12 because the new prompt fixes the production behaviour observed last week." The audit trail justifies the call.

Schedule a follow-up: "return to case-12 in two weeks and try to recover the lost behaviour." Otherwise the regression becomes permanent by neglect.