The Memory-Pressure Investigation Agent: A Case Study
From a single OOM page to root cause in 9 minutes. The exact prompts, tool calls, and decisions an agent made, with where it nearly went wrong.
Starting from the OOM page
The memory-pressure agent starts from a structured OOMKilled alert. The first two actions are mechanical and bound the search space before reasoning begins.
- Alert payload. Pod name, timestamp, memory limit. The agent receives the structured fields, not the raw alert text.
- Pull memory metrics. 30-minute window for the pod. The shape of the curve answers the first question: spike or growth.
- Pull recent log lines. OOM-causing operations sometimes log immediately before the kill; that line is high-leverage context.
- Bound the window. Cap initial gathers at 30 minutes; broaden only if the shape requires it. Wider windows multiply tokens with diminishing return.
The reasoning path
The memory shape selects the reasoning branch. Three shapes cover almost every OOM cause cleanly.
- Sudden spike. Usually a single operation that allocated a large object. Look for log lines mentioning large data sets, batch jobs, or large API responses.
- Gradual growth. Usually a memory leak. Look at heap dumps if available; deploy timing and feature flag flips often correlate.
- Periodic spikes. Usually a scheduled job. Cron entries, scheduled API calls, or background tasks.
- Mixed shape. Sudden spike on top of gradual growth. The agent reports both and treats them as independent causes rather than collapsing them.
9-minute resolution
The agent budgets nine minutes end to end. The pacing below keeps the run within the on-call’s patience window while leaving time for human approval.
- Minutes 1-2: gather. Metrics and logs. Do not branch on partial data.
- Minutes 3-4: classify. Sudden, gradual, or periodic. Mixed-shape detection happens here.
- Minutes 5-6: identify the trigger. Deploy, batch, or leak. Cite the supporting evidence inline.
- Minutes 7-9: propose and surface. Recommend raise limit, restart, or rollback; surface for human approval if the action is risky.
Where the agent nearly went wrong
Three near-misses shaped the prompt. Each surfaced a real failure mode the eval set now covers.
- Stale-deploy attribution. Misattributed a spike to a deploy that was 6 hours old. The prompt was tightened to require time correlation between trigger and spike.
- Limit-raise on leak. Recommended raising the memory limit when the underlying problem was a leak. The prompt now distinguishes the two cases explicitly.
- Bare restart. Suggested a restart that would have masked the issue. The agent now recommends “restart and watch for recurrence within N minutes,” not just “restart.”
- Each fix shipped with an eval case. The near-miss is a regression guard; future prompt changes have to keep passing it.
Trust earned
Trust is the long-tail value of an operating agent. The first 90 days build it; the next 180 days harvest it.
- 0-90 days: read-only. Hypotheses surface for human review; the agent never acts. The team builds confidence in the reasoning before granting capability.
- 90-180 days: bounded action. Granted capability to raise memory limits within bounds. Bounds matter more than the grant itself.
- 180+ days: paired restart. Trusted to restart pods. Restart-without-fix patterns dropped because the agent pairs restarts with leak-detection follow-up.
- Trust is reversible. A single bad action drops the agent back a tier. The ratchet is one-way only when the trust history stays clean.