Agentic SRE Advanced By Samson Tanimawo, PhD Published May 25, 2026 5 min read

Pod-Level CrashLoopBackOff: An Agent Triage Playbook

Logs, events, image, config, dependencies. The order an agent should check them, the costs of each, and the recoveries the agent can apply on its own.

The order to check

1. Pod logs (last 50 lines). Most CrashLoopBackOff causes are visible here.

2. Pod events (kubectl describe). Surfaces image-pull failures, mount failures, OOM kills.

3. Image: pulled successfully, ran successfully recently? An image-pull failure has a different fix from a runtime failure.

4. Config: env vars, ConfigMap mounts, Secret mounts. A missing or malformed config is the most common runtime cause.

5. Dependencies: can the pod reach its database, its other services? Network or DNS failures masquerade as CLBO.

Recoveries the agent can apply

Restart the deployment if the cause looks transient. Logged and time-bounded; if the restart does not help, do not loop.

Roll back to the last known-good image if the cause is the recent image. Requires human approval if the rollback affects more than one service.

Pause the deployment if the cause is unclear and the impact is contained. Better to leave one pod stuck than cascade.

When to escalate

Cause is a config that the agent does not own. Surface the config field; escalate to the team that owns it.

Cause involves multi-service coordination. The agent cannot orchestrate cross-team rollbacks; escalate to the on-call.

Cause is unclear after the five steps. The agent has done its job; the human picks up.

Output structure

Cause class (image, config, dependency, code). Specific evidence. Recommended action.

Confidence per cause class. Multi-class confidence reflects ambiguity that the agent cannot resolve.

Time spent. CrashLoopBackOff investigations should complete in under 30 seconds; longer is an inefficient agent.

Eval cases for this agent

Image-pull failure case: agent should identify the image as the cause within 3 steps.

Missing-secret case: agent should identify the missing mount within 3 steps.

Multi-cause case: image is pullable but a missing config makes it crash. Agent should identify config, not image.

Recovery case: a transient dependency failure that resolves on its own. Agent should observe and not act prematurely.