Why Your Agent Logs Should Pre-Date the LLM Call

Most agent logs start at the LLM response and miss the most important data: what the agent decided to send. The pre-call log line, with rationale, and how to use it to debug regressions.

The pre-call gap most logs have

Most agent logs start at the LLM response. That snapshot is too late to debug a wrong answer; the data the model used to produce the answer is already gone.

Where logs typically start. “The model said X.” The data the model used to say X is lost before the operator looks.
Why it hurts. When the model says something wrong, you cannot debug without that data. The operator stares at a wrong answer with no context.
Pre-call logging fills the gap. Log what was sent before what came back. Five extra log lines per run are worth their weight.
Cost vs benefit. The storage cost is small relative to the cost of an undebuggable agent regression.

What to log pre-call

Pre-call entries should be enough to reconstruct the call exactly. Four fields cover almost every case.

Full prompt. System message, conversation messages, tool definitions. If too large to log raw, hash the body and log key fields.
Model name and version. “GPT-4o” is not enough; include the date version so model upgrades are traceable.
Tools available at call time. The agent’s available tools change per run; the snapshot matters when debugging tool-selection bugs.
Working memory snapshot. The structured state the agent had assembled before the call. The model saw this; the operator must too.

What to log post-call

Post-call entries pair with the pre-call ones. Together they form the request-response unit that supports replay and debugging.

Full response. Text, tool calls, and structured output if applicable. Raw, not summarised.
Token counts. Input, output, and cache tokens. Latency tuning depends on the cache hit rate.
End-to-end latency. Request-sent to response-received. Server-side latency alone is not enough.
Errors and partial responses. Truncations, timeouts, retry triggers. Without these, post-mortems on failed runs hit a wall.

Why this matters when something breaks

The pre-call log lets you split the bug into one of three layers cleanly. Without it, the operator guesses across all three at once.

Bug type 1: model wrong. Pre-call log shows the prompt was correct; the bug is the model. Switch model or prompt-engineer.
Bug type 2: missing context. Pre-call log shows the missing field; the bug is in the agent’s working-memory assembly. Fix upstream.
Bug type 3: malformed prompt. Pre-call log shows the malformation; the bug is in the prompt template. Fix the template.
Bug type 4: tool drift. Pre-call tool snapshot shows a tool the agent expected was missing or had a changed schema. Fix the registry.

Managing log size

Full prompts are large; logging them all costs more than it returns. The tiered approach below keeps debugging cheap without paying the worst-case storage bill.

Sample raw, hash the rest. Log raw for the first 1 percent of runs (debug sample); hash the rest with a body-stored-elsewhere reference.
Cheap body store. The body store is object storage. Pull on demand when debugging; day-to-day operations do not pay the size cost.
Tiered retention. Hot logs 7 days, body store 30 days. Past 30 days, only the hash and key fields remain.
Sensitive data scrub. Strip secrets and PII at log time, not at query time. The cost of getting this wrong is much larger than the storage savings.