Debugging an Agent That Made the Wrong Call
The five-question debug rubric. Was the tool result wrong? The prompt missing context? The model confused? The plan flawed? The output mis-parsed? Asked in this order, the bug is usually in the first answer.
The five-question rubric
The rubric isolates the wrong call to one of five layers. Walking it in order shortens the debugging path because the most common bugs sit in the first two questions.
- Q1: was the tool result correct? Compare what the tool returned with reality. Garbage in, garbage out is the most common cause.
- Q2: did the prompt have the right context? The model can only reason on what it sees; missing context produces wrong calls.
- Q3: was the model confused? The residual after Q1 and Q2 are ruled out. Fixing it means prompt or model changes.
- Q4 and Q5: plan flawed or output mis-parsed. The model executed a wrong plan correctly, or produced the right answer that the parser misread.
Why this order
The order is empirical. Walking it in this sequence shortens the average debug time by half compared to starting with prompt rewrites.
- Steps 1 and 2 cover 70 percent. Bad data or missing context. Most agent bugs sit here; solve these first.
- Step 3 is the residual. Model confusion. Fixing it requires prompt or model changes; do not jump to it without ruling out 1 and 2.
- Steps 4 and 5 are infrastructure bugs. Less common but harder to diagnose without the rubric to direct attention.
- Stop on first match. Once one step explains the wrong call, stop. Pursuing the other four is overrun.
Artefacts you need to debug
Without the artefacts below, the rubric collapses into guesswork. Make them required logging on every agent run, not on-demand.
- Full prompt and response. The exact text sent to the model and the exact text returned. No summaries.
- Tool calls and responses. Each call, each response, each timestamp. The audit trail for steps 1 and 2.
- Surrounding infrastructure logs. Sometimes the bug is in the tool, not the agent. The only way to know is to compare the tool’s logs with the agent’s view of them.
- Reproducer. Eval case or production case that reproduces the wrong call. Without a reproducer, debugging is guesswork.
Once you know which step is wrong
The fix depends on which question failed. Each step has a canonical remedy that scales without rewriting the whole agent.
- Step 1. Fix the tool, fix the wrapper, or add validation. The fix lives upstream of the model.
- Step 2. Add the missing context to the prompt or the agent’s input. Often a working-memory assembly bug.
- Step 3. Rephrase the prompt or switch models. Last resort because the change surface is large.
- Steps 4 and 5. Refactor the planning logic, or tighten the parser and use structured output.
Add an eval case after the fix
The reproducer is the artefact that turns a one-off debug into a regression guard. Always commit the case alongside the fix.
- Reproducer becomes eval. Whatever caused the wrong call has a reproducer. Convert the reproducer into an eval case verbatim.
- Run before and after. Run the case before fixing (proves the case captures the bug). Run after (proves the fix). Commit both states.
- Loud regressions. Future regressions on this case will fail the eval gate. The suite has gained one more useful case.
- Avoid duplicate cases. If an existing eval case already covers the bug, update its assertions rather than committing a duplicate.