Shared Memory for Multi-Agent SRE Systems
Agents need to know what their peers learned. The shared scratchpad, the consistency model, and the pruning policy that keeps it from becoming a kitchen sink.
The shared scratchpad
The shared scratchpad is the run’s working memory. Without it, specialists either re-derive each other’s work or hand off through unstructured prose, both of which compound errors.
- Structured object. A typed object that every agent in the run can read and append to. Each agent’s findings become available to subsequent agents immediately.
- Entry schema. Each entry carries agent_id, timestamp, type, content. Type is one of hypothesis, evidence, decision, action_taken.
- Run scope. Persisted for the duration of the run; archived after. Not a long-term memory; the run boundary is the lifecycle.
- Visibility. All agents in the run can see all entries. Cross-agent privacy belongs at a different layer; the scratchpad is shared by definition.
Consistency model
The consistency rules below are minimal but load-bearing. Each one prevents a specific failure mode that real multi-agent runs hit.
- Append-only. Agents cannot modify or delete prior entries. Conflicting hypotheses become two entries; the orchestrator picks between them.
- Read-your-writes. An agent reads the entry it just wrote immediately. Other agents see it on their next read.
- Total ordering by timestamp. The scratchpad is a sorted log; reads return entries in timestamp order so reasoning over history is deterministic.
- No partial writes. Each entry is atomic. A failed write is not visible to other agents.
Pruning policy
Most runs do not need pruning. The default is to leave the scratchpad alone; pruning logic is overhead unless the run goes long.
- Default off. The scratchpad is small under 100 entries even for long runs; pruning costs more than it saves.
- Threshold-triggered summarisation. If the scratchpad exceeds a size threshold, summarise the oldest half and replace with the summary. Rare in practice.
- Archive after run. When the run completes, archive the scratchpad to cold storage. Useful for debugging; out of the working set.
- Audit pinning. Decision and action_taken entries are pinned and never pruned, even when summarisation runs. The audit trail must survive.
Resolving conflicts between agents
Conflicts are inevitable when specialists overlap. The rules below decide which conflicts auto-resolve and which escalate.
- Contradictory hypotheses. The orchestrator reads both and picks based on confidence and supporting evidence; the loser stays in the log as context.
- Contradictory actions. Hard stop. One of them is wrong. Escalate to a human; do not pick an action automatically when two agents disagree.
- Updates as new data arrives. The append-only model means the latest entry is authoritative; older entries become reasoning context, not stale state.
- Tie-break by recency. When confidence is identical, prefer the entry written by the most recently scheduled agent; that agent saw the freshest data.
Evaluating shared memory
The eval set covers the four ways shared memory can fail. Run it on every change to the scratchpad schema or the orchestrator.
- Cross-agent influence. One agent’s finding should change a later agent’s behaviour. Pass if it does; fail if downstream agents act on stale state.
- Conflict resolution. Two agents conflict. Pass if the orchestrator picks correctly or escalates; fail if it picks the wrong one silently.
- Scratchpad bloat. Long runs trigger summarisation. Pass if pruning kicks in cleanly; fail if the agent’s prompt grows unbounded.
- Audit completeness. Pinned decisions survive every prune. Pass if the audit trail is intact end to end; fail if any decision is missing from the archived log.