Agentic SRE Advanced By Samson Tanimawo, PhD Published May 23, 2026 5 min read

The Memory-Pressure Investigation Agent: A Case Study

From a single OOM page to root cause in 9 minutes. The exact prompts, tool calls, and decisions an agent made, with where it nearly went wrong.

Starting from the OOM page

Alert fires: pod X was OOMKilled. The agent receives the pod name, the timestamp, the memory limit.

First action: pull memory metrics for the pod over the last 30 minutes. Was this a sudden spike or gradual growth?

Second action: pull the pod's recent log lines. OOM-causing operations sometimes log just before the kill.

The reasoning path

Sudden spike: usually a single operation that allocated a large object. Look for log lines mentioning large data sets, batch jobs, or large API responses.

Gradual growth: usually a memory leak. Look at heap dumps if available; otherwise, deploy timing and feature flag flips often correlate.

Periodic spikes: usually a scheduled job. Cron entries, scheduled API calls, or background tasks.

9-minute resolution

Minute 1-2: gather metrics and logs.

Minute 3-4: classify (sudden / gradual / periodic).

Minute 5-6: identify the trigger (deploy, batch, leak).

Minute 7-8: propose remediation (raise limit, restart, rollback).

Minute 9: surface the proposal for human approval if remediation is risky.

Where the agent nearly went wrong

Misattributed a spike to a deploy that happened earlier. The deploy was 6 hours old; the spike was new. Time correlation matters; the agent's prompt was tightened to require it.

Recommended raising the memory limit when the underlying problem was a leak. Raising the limit delays the OOM but does not fix the leak. The agent's prompt now distinguishes the two cases.

Suggested a restart that would have masked the issue. The agent now suggests "restart and watch for recurrence within N minutes" rather than just "restart."

Trust earned

After 90 days of the memory agent running in read-only mode, the team trusted its hypotheses. Granted action capability for raising memory limits within bounds.

After 180 days, the agent was trusted to restart pods automatically. Restart-without-fix patterns dropped because the agent paired restarts with leak-detection.

Trust is the long-tail value of an agent. The first 90 days build it; the next 180 days harvest it.