Pod-Level CrashLoopBackOff: An Agent Triage Playbook
Logs, events, image, config, dependencies. The order an agent should check them, the costs of each, and the recoveries the agent can apply on its own.
The order to check
CrashLoopBackOff is one of the few Kubernetes states where the diagnostic order is almost always the same. The five checks below cover roughly 95 percent of cases and run in under a minute.
- Pod logs. Last 50 lines. Most CLBO causes are visible here; start with what the application itself said before it died.
- Pod events.
kubectl describe. Surfaces image-pull failures, mount failures, OOM kills, and probe failures the application logs cannot show. - Image and config. Did the image pull and run cleanly recently? Are env vars, ConfigMap mounts, and Secret mounts present and correct? A missing or malformed config is the most common runtime cause.
- Dependencies. Can the pod reach its database and its other services? Network and DNS failures routinely masquerade as CLBO.
Recoveries the agent can apply
Three recoveries cover most CLBO situations. The agent picks one based on the cause class and falls back to escalation when none fit cleanly.
- Restart the deployment. When the cause looks transient. Logged and time-bounded; if the restart does not help, the agent does not loop.
- Roll back the image. When the cause is a recent deploy. Requires human approval if the rollback affects more than one service.
- Pause the deployment. When the cause is unclear and the impact is contained. Better to leave one pod stuck than cascade across the cluster.
- Bounded retry. Restarts cap at three attempts within ten minutes. Beyond that, the agent escalates rather than amplifying the loop.
When to escalate
Escalation is the safety valve. Three classes of cause sit outside the agent’s scope; escalating quickly is more useful than guessing.
- Unowned config. Cause is a config the agent does not own. Surface the field; escalate to the team that owns it.
- Multi-service coordination. The agent cannot orchestrate cross-team rollbacks; escalate to on-call.
- Unclear cause after five steps. The agent has done its job; the human picks up with the diagnostics already gathered.
- Repeat offender. Same pod CLBO three times in a week with different causes; escalate for a deeper review beyond the immediate fix.
Output structure
The agent’s output is structured so the on-call can scan it in seconds. Free-text output erodes the value of the triage.
- Cause class. One of image, config, dependency, code. Single class per output, not a list.
- Specific evidence. The log line or event that established the cause class. No claim without a citation.
- Recommended action. A specific next step (apply, rollback, escalate) with the exact command if applicable.
- Confidence and time. Confidence per cause class plus elapsed time. Investigations should complete in under 30 seconds; longer indicates an inefficient agent.
Eval cases for this agent
The eval set covers the four shapes the triage gets wrong most often. Run it on every change to the prompt or the toolset.
- Image-pull failure. Agent should identify the image as the cause within three steps.
- Missing-secret case. Agent should identify the missing mount within three steps and name the specific secret.
- Multi-cause case. Image is pullable but a missing config makes it crash. Agent should identify config, not image.
- Recovery case. A transient dependency failure that resolves on its own. Agent should observe and not act prematurely.