Long-Running Agents: Memory, Recovery, Cost
A 30-second chatbot turn is easy. A 2-hour autonomous agent run is hard. State, recovery, and cost-control are the engineering problems that change at scale.
What changes at long horizons
For runs under a minute, the agent’s “state” is just its prompt. Long-running agents need persistent state, fault tolerance, and cost ceilings, problems familiar to anyone who’s built distributed systems.
Durable memory
Three levels of memory:
- Working memory: the current LLM context. Limited to whatever fits in the context window.
- Episodic memory: what the agent has done so far in this run. Stored externally; queryable.
- Long-term memory: knowledge accumulated across runs. Stored as embeddings or structured data; retrieved when relevant.
The split is critical: working memory holds tactical detail; episodic and long-term hold strategic context.
Recovery from failure
A 2-hour agent that crashes after 1.5 hours and starts over wastes a lot of compute. Production agents persist their progress at every step (durable workspace) and resume from the last checkpoint on restart.
Idempotent tool calls matter here: if the agent restarted and re-issued an action, it shouldn’t double-charge a customer or send two emails.
Cost control
Long-running agents can quietly burn through budget. Three guardrails:
- Hard token limit: agent halts if total tokens exceed N.
- Time limit: agent halts after T minutes regardless of progress.
- Step limit: agent halts after K iterations.
All three are belt-and-braces. Skipping any of them means a runaway agent eventually costs you a five-figure surprise.
Production patterns
- Manager + workers: a long-running manager dispatches short-lived worker tasks. Each worker has narrow scope. Manager state persists.
- Resumable jobs: the whole agent state (workspace, plan, results so far) is a serialisable JSON object. Snapshot every step. Resumable from any.
- Human-checkpointed: the agent pauses at strategic milestones, summarises progress, asks the human to confirm before the next phase.
Long-running agents are economically interesting precisely because they replace many short human work sessions with one autonomous run. The engineering to make them reliable is non-trivial; the patterns above are what production systems converge on.