Why SRE Agents Need Two Memory Tiers
Working memory for the current incident. Long-term memory for past ones. The schema, the retrieval strategy, and the eviction policy that keeps the right context in the prompt.
Why one memory tier is not enough
Working memory holds the current incident: the alert, the metrics, the actions taken so far. It is small, hot, and volatile. Long-term memory holds patterns from past incidents: what worked, what did not, who owned what. It is large, cold, and stable.
Mixing the two breaks both. If you put past incidents into working memory, your prompt blows up. If you treat the current incident as long-term, it pollutes future retrievals. Two tiers is the minimum architecture.
Most agents that struggle in production are running on one tier. The fix is rarely in the model; it is in splitting the memory.
Working memory belongs in the prompt
Working memory is whatever the agent needs to keep in context for the current run. The alert payload, the recent metric values, the steps taken so far. It is shaped by the run, not the system. It dies when the run ends.
Keep working memory tight. The temptation is to dump everything in. Resist. A well-structured working memory of 2k tokens beats a sprawling one of 20k. The model attends better; the cost is lower.
Use a structured format: JSON with named fields, not free-form text. The model handles structure well, and you can reason about what is in working memory at any point.
Long-term memory belongs in a vector store
Long-term memory is the corpus of past incidents, runbooks, and postmortems. It is too big to fit in the prompt. It is queried via retrieval, with the top-k matching documents pasted into the prompt at the moment of need.
Pick a vector store that supports filters. "Find me past incidents matching this signature, but only for service X, in the last 90 days, that resulted in a successful remediation." Filters are how you keep retrievals on-topic.
Index the corpus thoughtfully. The naive index treats every postmortem as one document; the better index splits it into sections, each tagged with the incident's signature. Retrieval quality is dominated by indexing quality.
The retrieval strategy that works in production
Retrieve at two points: at the start of the run (broad context about the service and recent incidents), and at decision points (targeted lookup for the current hypothesis). Two retrievals, each scoped, beats one broad retrieval.
Cap the retrieval payload size. The agent does not need every matching document; it needs the top three with the most relevant excerpt from each. Caps keep the prompt focused and the cost predictable.
Always pass the retrieval query through the prompt explicitly. The model should know what it asked for and what it got back. Hidden retrieval is hard to debug when the agent makes an unexpected choice.
The eviction policy that prevents drift
Long-term memory grows. Old incidents become misleading once the system changes. Evict aggressively: anything more than 12 months old, anything tagged as superseded, anything from a service that has been retired.
Eviction is a quarterly job, not a continuous one. Pause the agent for a few minutes; clean the index; resume. The cleanup catches stale knowledge that retrieval would otherwise surface as authoritative.
Track retrieval staleness as a metric. The fraction of retrieved documents that the agent decided were irrelevant is a leading indicator of index rot.
What to do this week
Audit your agent's memory. If it is one tier, split it. Move past incidents and runbooks into a vector store; keep current run state in a structured working memory. Re-run your eval suite; the gains usually show up in long-tail cases where the agent now retrieves the right historical context.