AI & ML Advanced By Samson Tanimawo, PhD Published Aug 15, 2026 7 min read

Long-Running Agents: Memory, Recovery, Cost

A 30-second chatbot turn is easy. A 2-hour autonomous agent run is hard. State, recovery, and cost-control are the engineering problems that change at scale.

What changes at long horizons

Short agents finish in minutes. Long-running agents persist for hours, days, or weeks. The patterns that work at minutes break at days. State doesn't fit in context; processes crash and need recovery; cost compounds; debugging spans many sessions. Building long-running agents is engineering work that goes beyond LLM prompting.

The state shift. Short agents fit their entire state in the prompt. Long agents accumulate state that exceeds context. They need durable memory, a database, a vector store, a structured plan log, that the agent reads and writes across sessions.

The reliability shift. Short agents that crash can be rerun. Long agents that crash mid-task lose hours of work; rerunning isn't an option. Recovery becomes a first-class concern: which step were we on, what's already been done, where do we resume.

The cost shift. Short agents cost cents. Long agents cost dollars to hundreds of dollars. Cost monitoring goes from "nice to have" to "required". A misconfigured long agent can spend $1,000 in a day; you need alerts before, not after.

The observability shift. Short agents are debuggable from logs. Long agents need telemetry: per-session traces, plan-state visualisation, replay capability. The investment in observability is what makes long-running agents debuggable in production.

Durable memory

Three patterns:

Most production agents use all three. Vector for semantic recall; structured for typed state; document for narrative. Each plays a different role; combining them is more powerful than any alone.

The vector pattern in detail. Embed every observation; store with metadata (timestamp, source, type). On each step, retrieve the top-K most relevant observations for the current context. Pros: scales to huge memory; flexible recall. Cons: retrieval quality matters a lot; bad retrieval = bad context.

The structured pattern in detail. Define entities (Customer, Ticket, Order) and relationships. The agent reads via queries ("get all open tickets for Customer X") and writes via mutations ("close Ticket Y"). Pros: precise; supports complex queries. Cons: requires schema design; brittle if data shape evolves.

The document pattern in detail. The agent maintains a markdown file: project goals, decisions, open questions, next steps. Reads the file at session start; writes updates at session end. Pros: human-inspectable; great for handoff to operators. Cons: prone to bloat; agent must summarise periodically.

Recovery

Long agents will crash. Plan for it. Checkpoint state after every significant action. On restart, load the latest checkpoint. The agent should resume rather than restart, if you have to start over, you'll spend more on retries than the original task.

The checkpoint design. After each tool call (or each plan-step completion), persist: current plan, completed steps, pending steps, observations gathered, intermediate results. The granularity is a trade-off: too fine (per-token) is overhead; too coarse (per-day) loses too much on crash. Per-step is usually right.

The resume protocol. On startup, load latest checkpoint. Reconstruct context (the agent's prompt) from the checkpoint state. Continue from where the previous run stopped. The reconstruction is a non-trivial step; build it deliberately, not as an afterthought.

The idempotency requirement. Replaying a checkpoint shouldn't re-do already-done work (re-emailing the customer, re-charging the card). Tool calls must be idempotent, either truly idempotent (DELETE is naturally so) or guarded by application-level dedup (request IDs, idempotency keys).

The crash-test discipline. Periodically kill the agent mid-task; verify it resumes correctly. If you don't test recovery, recovery is broken when you need it. Once a week is reasonable; "never" guarantees production failures.

Cost control

Long agents can run up surprising bills. Set hard caps per task and per day. Alert on cost approaching cap. Cap on TASKS not requests, a stuck loop will burn through any per-request limit.

The per-task cap. "This task is allowed at most $X of compute spend." When the cap is hit, the agent stops and surfaces to a human. The cap forces conscious budgets; without it, edge cases produce $1,000 invoices.

The per-day cap. "Total agent spend across all tasks shall not exceed $Y/day." When hit, all running tasks pause. Stops runaway scenarios where one bug spawns 100 expensive tasks.

The alert thresholds. 50% of cap is information; 75% is warning; 90% is action. Engineers should hear about budget approach before exhaustion, not after. The alert protocol is what prevents one bad day from being a six-figure bad day.

The cost-attribution requirement. Tag every spend by task and feature. Aggregate spend isn't actionable; per-task spend tells you which tasks are economic and which aren't. The attribution is also what funds optimisation work, concrete cost figures justify concrete engineering investment.

Production patterns

Two patterns we've seen work:

Both bound the agent's autonomy. Pure unbounded agents are operationally too risky.

The plan-bounded pattern in detail. Phase 1: agent reads task and writes plan. Phase 2: execute. Plan is in durable memory; updates are logged. The plan is also the user-visible artifact, operators can read the plan to know what's being done.

The time-bounded pattern in detail. Agent runs for X hours doing whatever it thinks best. At X hours, it surfaces a status: what was done, what's next, what blockers. Human reviews and decides: continue, stop, redirect. The time bound is the safety valve.

The hybrid. Time-bounded with periodic plan review. Every N hours, the agent surfaces a plan update; human approves; agent continues. Bounds long-running autonomy while still extracting value over multi-day timescales.

Common antipatterns

Stuffing all state in the prompt. Eventually exceeds context; the agent starts forgetting. Move long-term state to durable memory.

No checkpoints. One crash and a day's work is lost. Checkpoint per significant step.

No cost caps. One bug runs for 12 hours and bills $5k. Caps prevent the worst-case scenarios.

No human checkpoint for autonomy > 4 hours. The longer the agent runs unsupervised, the more drift accumulates. Build in periodic human review even for "trusted" agents.

What to do this week

Three moves. (1) For your longest-running agent, document its state model: what's in prompt context, what's in durable memory, what's in external systems. The map surfaces gaps. (2) Add per-task cost caps if missing. The cost of building the cap is small; the cost of not having one is the next runaway bill. (3) Test recovery: kill an agent mid-task; verify it resumes correctly. If recovery doesn't work, you've found a bug worth fixing immediately.