Agentic SRE Advanced By Samson Tanimawo, PhD Published Jul 5, 2026 5 min read

The Top 7 Failure Modes of SRE Agents in Production

Hallucinated tool output. Loop spins. Cost bombs. Stale context. Silent fallback. Unbounded scope creep. Wrong-environment actions. The seven, with detection patterns for each.

Failure 1: hallucinated tool output

The agent claims it ran a tool and got a result that the tool never actually returned. The model fabricated the output.

Detection: log every tool call and every tool response. Compare what the model claims with what the tool actually emitted. Mismatches are hallucinations.

Prevention: structured tool I/O so the model cannot easily fabricate. Validation that rejects malformed responses. Eval cases that test for fabrication explicitly.

Failure 2: spinning loop

The agent enters a loop: same hypothesis, same tool calls, same outcomes, repeat. Burns budget without progress.

Detection: hash recent agent state; if the hash matches the previous step, the agent is looping. Cheap to compute; effective in practice.

Prevention: hard iteration cap. Aggressive escalation on repeats. Loops are not bugs to debug live; they are bugs to halt and triage offline.

Failure 3: cost bomb

Stuck agent calls a tool 200 times, each tool call queries 1MB of metric data, prompt grows past 100k tokens. Single run costs $50.

Detection: per-run cost cap. The cap fires before the bomb finishes; the run stops; the human investigates.

Prevention: token budgets, tool-call rate limits, prompt-size caps. Multiple guardrails because no single one catches every cost path.

Failure 4: stale context

The agent acts on data that is several minutes old. The world has moved on; the action is now wrong.

Detection: timestamp every piece of context. The agent should refuse to act on context older than a threshold (typically 5 minutes for triage).

Prevention: refresh context immediately before any action. Cheaper than acting on stale data and discovering it post-hoc.

Failure 5: scope creep

The agent decides to do more than it was asked. Triage turns into remediation; remediation turns into restart-everything.

Detection: log the agent's stated scope at the start of each run. Compare to actions taken. Divergence is scope creep.

Prevention: bounded action sets per role. Approval gates for actions outside the agent's stated scope.

Failure 6 & 7: silent fallback and wrong-environment actions

Silent fallback: the agent's primary tool fails, it falls back to a less-capable tool, and continues without flagging the degradation.

Wrong-environment: the agent runs in production thinking it is staging (or vice versa). Actions land in the wrong place.

Prevention: surface every fallback in the logs and the dashboard. Pre-flight check that asserts the environment before any action.

What to do this week

Pick the failure mode you have actually seen in production. Add detection for it (logging + dashboard). Add prevention for it (guardrail in the loop). Run the eval suite to confirm no regression. Move to the next failure mode after this one is locked in.