Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 15, 2026 5 min read

Replay-Driven Evals from Past Incidents

Your last 50 incidents are your best eval suite. The pipeline that anonymises them, replays them against new agent versions, and surfaces regressions before deploy.

Source incidents from your own history

The last 50 incidents are the best eval suite you can build. They are real, they cover your actual surface area, and they include the long-tail your agent will face in production.

Pull from your incident management system: IDs, timelines, impacted services, root causes, remediation actions. Filter for ones the agent should have handled or assisted with.

Anonymise. Customer names, internal hostnames, secrets, all stripped before the data lands in the eval pipeline. The redaction is non-negotiable.

Convert each incident into a case

Input: the alert payload that fired, the metrics window for the affected service, the recent deploys at the time. Capture the inputs the agent would have seen at the moment of the page.

Expected output: the cause the human team identified, the action they took, the time they spent. The agent's score is how close it gets to each.

Edge cases get tagged. "Misdiagnosed initially" or "required two-team coordination" or "had non-obvious cause" are tags worth carrying forward.

The replay loop

Run each case against the agent. Capture the full trace: prompt, response, tool calls, final hypothesis. Compare to expected.

Differences are signal, not failure. The agent might propose a different valid cause; mark it as alternative-correct, not wrong. The eval scorer needs to handle this.

Cases the agent fails repeatedly become flagged for prompt review. The prompt is missing something the human knew; figure out what and add it.

Regression detection across versions

Run the replay suite on every prompt change. The CI report shows: cases that pass v_old but fail v_new. These are regressions; they block merge by default.

An override is allowed but must be deliberate. "This regression is acceptable because the new behaviour is better in the cases that matter." The override is logged.

Track the regression rate over time. Healthy: low and stable. Unhealthy: trending up. The latter is a sign the prompt or the model is drifting.

Treating past incidents responsibly

The team that lived through the incident should review the case before it lands in the suite. They might disagree with the framing, the redaction, or the inclusion. Their judgement wins.

Cases involving personnel decisions or root-cause-as-individual stay out. Eval cases are technical regressions, not blame archives.

If a case is contentious, drop it. The corpus has 50+ cases; one less rarely matters. The trust of the team always matters.

What to do this week

Pull your last 30 incidents. With the team's blessing, convert 10 into eval cases. Run them against the current agent. The cases the agent handles correctly become regression sentinels. The cases the agent fails become the next prompt-engineering work.