When SRE Agents Hallucinate Tool Output (and How to Detect It)
Agents sometimes invent tool results that the tool never returned. The detection harness, the most common provocations, and the prompt-level fixes that work.
How it happens
Hallucinated tool output happens because the model is trained to produce confident answers. Three structural patterns make it worse and each has a fix.
- Confidence bias. When a tool returns nothing, the model often fills the gap rather than reporting the empty result.
- Vague tool descriptions. “Returns metric data” invites hallucination; “returns a JSON object with fields x, y, z, or null on no data” is specific.
- Long contexts. The model loses track of what was actually fetched and what it inferred. The longer the context, the worse the drift.
- Implicit conversion. When the model paraphrases tool output rather than quoting, it has already inserted an interpretation layer where hallucinations grow.
Detection harness
The harness wraps tool calls and cross-checks the model’s output against what the tool actually returned. Without it, hallucinations are invisible until they cause a downstream incident.
- Wrap every tool call. Log tool name, args, return value, return time. Pair the log with the model’s subsequent output for the same step.
- Daily cross-check job. For each agent run, find statements in the model output that reference tool data. Cross-check against the actual tool returns; mismatches get flagged.
- Human review first. Mismatches are human-reviewed at first. Once the harness is calibrated and false-positive rate is low, auto-flag without review.
- Tightly-scoped diff. Compare claims to specific tool returns rather than the full run. The narrower the comparison, the lower the noise.
Common provocations
Three prompt patterns provoke hallucination reliably. Knowing them by name makes them avoidable in design review.
- Summarising empty output. Asking the model to summarise tool output that was empty. Fix in the prompt: tell the model to say so explicitly when there is nothing to summarise.
- Combining multiple tools. The model may attribute facts to the wrong source. Fix with structured output that names the source per claim.
- Extrapolating beyond data. “Based on the trend, what is next week’s value” lets the model fabricate a trend if the data did not show one.
- Comparing to memory. Asking the model to compare current output to a prior run encourages confabulation when the prior run is not in context.
Prompt-level fixes
Three prompt-level changes catch most hallucinations before they ship. None require model changes.
- Explicit grounding clause. “Only use information that came from a tool call. If a tool returned no data, say so.” Verbatim.
- Show raw output. Show the model the raw tool output, not a pre-summarised version. Summarisation is where hallucinations creep in.
- Sourced structured output. Use structured output with a “sources” field. Each fact must attribute to a source; missing source raises a flag.
- Refusal pattern. When confidence is below threshold, the prompt instructs the agent to say “I cannot determine this from the available tools” rather than guessing.
The cost of detection
Detection is mostly compute, not engineering effort. The harness pays back through caught regressions.
- Per-run overhead. The harness adds 10 to 20 percent to per-run cost. Mostly storage and one extra cross-check job.
- Cost recovered. Each hallucination caught early prevents a downstream incident. The arithmetic favours always-on detection.
- On for risky changes. Disable detection in production after the harness is calibrated and the agent is stable; re-enable on every prompt change.
- Eval-set integration. Tool-output evals run the harness inline. Regressions in grounding fail the eval, not just the production run.