Why Your Agent Logs Should Pre-Date the LLM Call
Most agent logs start at the LLM response and miss the most important data: what the agent decided to send. The pre-call log line, with rationale, and how to use it to debug regressions.
The pre-call gap most logs have
Most agent logs start at the LLM response. That snapshot is too late to debug a wrong answer; the data the model used to produce the answer is already gone.
- Where logs typically start. “The model said X.” The data the model used to say X is lost before the operator looks.
- Why it hurts. When the model says something wrong, you cannot debug without that data. The operator stares at a wrong answer with no context.
- Pre-call logging fills the gap. Log what was sent before what came back. Five extra log lines per run are worth their weight.
- Cost vs benefit. The storage cost is small relative to the cost of an undebuggable agent regression.
What to log pre-call
Pre-call entries should be enough to reconstruct the call exactly. Four fields cover almost every case.
- Full prompt. System message, conversation messages, tool definitions. If too large to log raw, hash the body and log key fields.
- Model name and version. “GPT-4o” is not enough; include the date version so model upgrades are traceable.
- Tools available at call time. The agent’s available tools change per run; the snapshot matters when debugging tool-selection bugs.
- Working memory snapshot. The structured state the agent had assembled before the call. The model saw this; the operator must too.
What to log post-call
Post-call entries pair with the pre-call ones. Together they form the request-response unit that supports replay and debugging.
- Full response. Text, tool calls, and structured output if applicable. Raw, not summarised.
- Token counts. Input, output, and cache tokens. Latency tuning depends on the cache hit rate.
- End-to-end latency. Request-sent to response-received. Server-side latency alone is not enough.
- Errors and partial responses. Truncations, timeouts, retry triggers. Without these, post-mortems on failed runs hit a wall.
Why this matters when something breaks
The pre-call log lets you split the bug into one of three layers cleanly. Without it, the operator guesses across all three at once.
- Bug type 1: model wrong. Pre-call log shows the prompt was correct; the bug is the model. Switch model or prompt-engineer.
- Bug type 2: missing context. Pre-call log shows the missing field; the bug is in the agent’s working-memory assembly. Fix upstream.
- Bug type 3: malformed prompt. Pre-call log shows the malformation; the bug is in the prompt template. Fix the template.
- Bug type 4: tool drift. Pre-call tool snapshot shows a tool the agent expected was missing or had a changed schema. Fix the registry.
Managing log size
Full prompts are large; logging them all costs more than it returns. The tiered approach below keeps debugging cheap without paying the worst-case storage bill.
- Sample raw, hash the rest. Log raw for the first 1 percent of runs (debug sample); hash the rest with a body-stored-elsewhere reference.
- Cheap body store. The body store is object storage. Pull on demand when debugging; day-to-day operations do not pay the size cost.
- Tiered retention. Hot logs 7 days, body store 30 days. Past 30 days, only the hash and key fields remain.
- Sensitive data scrub. Strip secrets and PII at log time, not at query time. The cost of getting this wrong is much larger than the storage savings.