Agent Initialization: Loading Context Without Burning Tokens
The naive init loads everything into the prompt and burns thousands of tokens. The init pattern that loads what is needed, in the order it is needed, with caching where caching helps.
The naive init that burns tokens
Naive init dumps everything: service description, runbook, last 100 incidents, full metric history, deploy log, on-call schedule. The prompt is 30k tokens before the agent does anything. Every run pays for the same setup.
Most of that context is irrelevant to the current run. The agent reads it because the model has to attend to everything in the prompt, but it does not use it. The waste is silent and large.
At scale, the naive init is the dominant cost line. A team running 10k agent invocations a day, each with a 30k-token init, is paying for 300M tokens of mostly-wasted context per day.
Just-in-time loading
Load the smallest context that lets the agent start: the alert payload, the service identifier, a one-paragraph service description. Everything else is loaded on demand via tool calls.
The agent decides what to load based on its initial reasoning. "I need recent deploys for service X" becomes a tool call; the result is added to working memory. The agent loads what it needs, when it needs it.
Just-in-time trades a bit of latency (each tool call is a network hop) for dramatic cost savings. For most production agents, the trade is favourable by 5-10x on cost.
Where caching wins
Static context, the service description, the runbook, the team's on-call policy, is identical across runs. Cache it with the model provider. Anthropic prompt caching, OpenAI cached prompts, similar mechanisms elsewhere.
Cache reads are 10x cheaper than cache misses. A 30k-token static block becomes essentially free after the first run that warms the cache. The economics flip.
Cache TTLs matter. The default 5-minute TTL works for high-frequency agents. For lower-frequency agents, raise the TTL or warm the cache deliberately. Cold caches are a tax you can engineer around.
Prompt layout for cache hits
Put the static content first. Static system prompt, static runbook, static service description. Then the volatile content: the alert payload, the metrics window, the action history.
This ordering is non-negotiable for cache effectiveness. Caches key on prefix; if the volatile content is in the middle, the cache breaks. Discipline the layout from day one.
Validate the cache hit rate in observability. The model providers report it; surface it in your dashboards. A drop from 80% to 30% is a regression worth investigating immediately.
Token budgets per run
Set a per-run token budget. Triage agents: 8k tokens. Investigation agents: 30k tokens. Postmortem agents: 100k tokens. Different roles, different budgets.
Track the distribution of actual usage. If p99 is below the budget, the budget is loose. If p50 approaches the budget, the budget is tight. Calibrate to leave headroom for unusual cases.
Alert on budget exhaustion. A run that hit its budget cap is a run that did not finish; surface it. Budget caps should be rare events, not routine.
What to do this week
Profile one of your agents. Measure: tokens per run at p50, p95, p99. Then measure how much of the prompt is static vs volatile. If static is over 40% of the prompt, add prompt caching, reorganise the prompt to put static first, and watch the cost graph drop. Most teams see 50-70% cost reduction from this single change.