Agent-Caused Incidents: How to Run the Postmortem
When the agent caused the incident, the postmortem template needs new sections. The template, the questions to ask, and the typical contributing factors.
Template additions
Section: agent decision history. The full chain of decisions the agent made, with the inputs at each step.
Section: guardrail review. Which guardrails were active, which fired, which did not but should have.
Section: prompt or model lineage. What prompt version, what model version, were they recently changed.
Section: eval coverage. Was this scenario in the eval suite? If yes, why did it pass eval but fail production? If no, why was it missed?
Questions to ask
Was the agent's input correct? Garbage in, garbage out is the most common cause.
Was the agent's reasoning correct given the input? Sometimes the inputs were right and the reasoning still went wrong.
Was the action correct given the reasoning? Sometimes reasoning is right but the action implementation has a bug.
Why did the guardrails not catch this? The most important question.
Common contributing factors
Stale context: the agent acted on data that was no longer accurate.
Tool output mis-interpretation: the agent read the tool output incorrectly.
Eval gap: the scenario was not in the test suite; the agent had never seen anything like it.
Guardrail mis-configuration: the guardrail existed but was tuned to allow this case through.
Prompt regression: a recent prompt update changed behaviour in a way the eval missed.
Action items that actually help
Add the failure case to the eval suite. Cheapest, highest-leverage action.
Tighten the relevant guardrail. Specific to the failure mode.
Add observability that would have surfaced the issue earlier. The agent acted for too long without detection.
Document the failure pattern for future agent designs.
Blameless toward the agent
The agent is a tool; agents do not get blamed. The blame, if any, is on the design that allowed the failure.
Avoid "the agent was wrong" framing. Prefer "the agent's design did not handle this case." Same fact, different action implications.
The team built the agent. The team owns the failure. The team learns and improves.