The Memory-Pressure Investigation Agent: A Case Study
From a single OOM page to root cause in 9 minutes. The exact prompts, tool calls, and decisions an agent made, with where it nearly went wrong.
Starting from the OOM page
Alert fires: pod X was OOMKilled. The agent receives the pod name, the timestamp, the memory limit.
First action: pull memory metrics for the pod over the last 30 minutes. Was this a sudden spike or gradual growth?
Second action: pull the pod's recent log lines. OOM-causing operations sometimes log just before the kill.
The reasoning path
Sudden spike: usually a single operation that allocated a large object. Look for log lines mentioning large data sets, batch jobs, or large API responses.
Gradual growth: usually a memory leak. Look at heap dumps if available; otherwise, deploy timing and feature flag flips often correlate.
Periodic spikes: usually a scheduled job. Cron entries, scheduled API calls, or background tasks.
9-minute resolution
Minute 1-2: gather metrics and logs.
Minute 3-4: classify (sudden / gradual / periodic).
Minute 5-6: identify the trigger (deploy, batch, leak).
Minute 7-8: propose remediation (raise limit, restart, rollback).
Minute 9: surface the proposal for human approval if remediation is risky.
Where the agent nearly went wrong
Misattributed a spike to a deploy that happened earlier. The deploy was 6 hours old; the spike was new. Time correlation matters; the agent's prompt was tightened to require it.
Recommended raising the memory limit when the underlying problem was a leak. Raising the limit delays the OOM but does not fix the leak. The agent's prompt now distinguishes the two cases.
Suggested a restart that would have masked the issue. The agent now suggests "restart and watch for recurrence within N minutes" rather than just "restart."
Trust earned
After 90 days of the memory agent running in read-only mode, the team trusted its hypotheses. Granted action capability for raising memory limits within bounds.
After 180 days, the agent was trusted to restart pods automatically. Restart-without-fix patterns dropped because the agent paired restarts with leak-detection.
Trust is the long-tail value of an agent. The first 90 days build it; the next 180 days harvest it.