Pod-Level CrashLoopBackOff: An Agent Triage Playbook

Logs, events, image, config, dependencies. The order an agent should check them, the costs of each, and the recoveries the agent can apply on its own.

The order to check

CrashLoopBackOff is one of the few Kubernetes states where the diagnostic order is almost always the same. The five checks below cover roughly 95 percent of cases and run in under a minute.

Recoveries the agent can apply

Three recoveries cover most CLBO situations. The agent picks one based on the cause class and falls back to escalation when none fit cleanly.

When to escalate

Escalation is the safety valve. Three classes of cause sit outside the agent’s scope; escalating quickly is more useful than guessing.

Output structure

The agent’s output is structured so the on-call can scan it in seconds. Free-text output erodes the value of the triage.

Eval cases for this agent

The eval set covers the four shapes the triage gets wrong most often. Run it on every change to the prompt or the toolset.