AI & ML Advanced By Samson Tanimawo, PhD Published Feb 10, 2026 7 min read

Long-Running Agents: Memory, Recovery, Cost

A 30-second chatbot turn is easy. A 2-hour autonomous agent run is hard. State, recovery, and cost-control are the engineering problems that change at scale.

What changes at long horizons

For runs under a minute, the agent’s “state” is just its prompt. Long-running agents need persistent state, fault tolerance, and cost ceilings, problems familiar to anyone who’s built distributed systems.

Durable memory

Three levels of memory:

Working memory: the current LLM context. Limited to whatever fits in the context window.
Episodic memory: what the agent has done so far in this run. Stored externally; queryable.
Long-term memory: knowledge accumulated across runs. Stored as embeddings or structured data; retrieved when relevant.

The split is critical: working memory holds tactical detail; episodic and long-term hold strategic context.

Recovery from failure

A 2-hour agent that crashes after 1.5 hours and starts over wastes a lot of compute. Production agents persist their progress at every step (durable workspace) and resume from the last checkpoint on restart.

Idempotent tool calls matter here: if the agent restarted and re-issued an action, it shouldn’t double-charge a customer or send two emails.

Cost control

Long-running agents can quietly burn through budget. Three guardrails:

Hard token limit: agent halts if total tokens exceed N.
Time limit: agent halts after T minutes regardless of progress.
Step limit: agent halts after K iterations.

All three are belt-and-braces. Skipping any of them means a runaway agent eventually costs you a five-figure surprise.

Production patterns

Manager + workers: a long-running manager dispatches short-lived worker tasks. Each worker has narrow scope. Manager state persists.
Resumable jobs: the whole agent state (workspace, plan, results so far) is a serialisable JSON object. Snapshot every step. Resumable from any.
Human-checkpointed: the agent pauses at strategic milestones, summarises progress, asks the human to confirm before the next phase.

Long-running agents are economically interesting precisely because they replace many short human work sessions with one autonomous run. The engineering to make them reliable is non-trivial; the patterns above are what production systems converge on.