Kubernetes By Nova AI Ops Team Published Sep 9, 2026 13 min read

Debugging Kubernetes Pod Crashes: A Triage Tree

When a pod is in trouble, the difference between a 5-minute fix and a 5-hour debugging session is knowing exactly which question to ask first. Here is the triage tree we follow at 3 a.m., with kubectl commands and root-cause patterns for each branch.

Step 0: Get the Pod Status

Every pod-crash investigation starts with the same three commands. Memorize them.

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous

The first command tells you the high-level state (Running, Pending, CrashLoopBackOff, etc.). The second tells you the events Kubernetes recorded about the pod, including image pulls, scheduling decisions, and health-check failures. The third gives you the application logs from the previous (crashed) container instance, which is where the actual error message lives.

The pod's high-level status determines which branch of the triage tree you take. Each section below covers one branch.

CrashLoopBackOff: The Application Is Failing

What it means: The container started, ran briefly, exited (or was killed), and Kubernetes is repeatedly restarting it with exponentially increasing backoff delays.

First command:

kubectl logs <pod-name> -n <namespace> --previous

This shows the last log output from before the crash. 80% of CrashLoopBackOff diagnoses end here. Look for a stack trace, an unhandled exception, a config file parse error, or a "connection refused" message pointing at a missing dependency.

Common root causes:

Useful follow-up commands:

kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Last State"
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

ImagePullBackOff: The Image Cannot Be Fetched

What it means: Kubernetes tried to pull the container image and failed. The pod will retry with backoff until it succeeds or you fix the underlying issue.

First command:

kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:"

The events section will tell you exactly why the pull failed. The five most common reasons:

OOMKilled: The Container Exceeded Memory Limit

What it means: The kernel's OOM killer terminated the container because it tried to use more memory than its resources.limits.memory allowed.

First command:

kubectl describe pod <pod-name> -n <namespace> | grep -i "OOMKilled"
kubectl top pod <pod-name> -n <namespace> --containers

Look for "Reason: OOMKilled" in the Last Terminated State. Then pull historical memory metrics from your monitoring system to understand the pattern.

Three common scenarios and the right fix for each:

Common gotcha: JVM and other runtimes that respect cgroup memory limits need explicit configuration. A Java app with -Xmx not set will use up to 25% of host memory, which is usually larger than the container limit. Always set -Xmx to roughly 70-75% of the container memory limit.

Pending: The Pod Cannot Be Scheduled

What it means: The scheduler cannot find a node that satisfies the pod's requirements (resource requests, node selectors, affinity rules, taints, or volume bindings).

First command:

kubectl describe pod <pod-name> -n <namespace> | grep -A20 "Events:"

The scheduler writes detailed messages explaining exactly why each candidate node was rejected. Read them carefully.

Common root causes:

Init Container Failures

What it means: One of the pod's init containers failed, blocking the main containers from starting.

First command:

kubectl logs <pod-name> -c <init-container-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Init Containers"

Init containers run sequentially and any failure blocks pod startup. Common patterns:

Liveness Probe Failures

What it means: The container started successfully but the kubelet is killing it because the liveness probe is failing.

First command:

kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Liveness"
kubectl logs <pod-name> -n <namespace> --previous

You will see "Liveness probe failed" events with the probe's HTTP status, exec exit code, or TCP connection error. This indicates either:

Common gotcha: A liveness probe that calls a database can cause cascading failures. If the database goes down, every pod fails its liveness probe and gets restarted, often making the database situation worse. Liveness probes should test only the application's own readiness, not external dependencies. Use readiness probes (which take the pod out of service rotation but do not restart it) for dependency checks.

Common Gotchas (DNS, ConfigMap, Secrets)

Three failure patterns that account for a disproportionate share of pod incidents:

1. DNS resolution intermittently fails: CoreDNS is overloaded, or NodeLocal DNSCache is misconfigured. Symptoms: random "no such host" errors that come and go. Fix: scale CoreDNS, deploy NodeLocal DNSCache, increase the pod's dnsConfig.options.ndots if FQDNs are slow.

2. ConfigMap or Secret update does not take effect: Mounted ConfigMap and Secret values update lazily (typically within 60 seconds), but most applications read them at startup and never re-read. Symptom: you updated the ConfigMap, but the pod is still using the old value. Fix: roll the deployment after the update, or use a config reload sidecar.

3. Secret volume mount missing keys: The Secret exists but a specific key is absent. Symptom: a file in the mount path is missing entirely (not empty, not present). Fix: validate the Secret has the expected keys with kubectl get secret -o yaml.

When to Bring in Observability Tools

The triage tree above gets you 80% of the way for most pod crashes. The remaining 20% are the genuinely hard cases: intermittent failures, multi-pod cascades, or anomalies that only show up under production load. For these, you need observability tools that look across pods and clusters.

The three signals that move investigations beyond kubectl describe:

How Nova AI Ops Auto-Resolves These

Nova AI Ops automates this triage tree. When a pod enters CrashLoopBackOff, the Workload Diagnostics agent runs the same investigation steps you would: fetch the previous logs, parse the stack trace, check recent ConfigMap/Secret changes, validate resource limits against historical usage, and compare against similar past incidents.

The agent then either applies a fix automatically (if the trust score for that service permits and the change is in the safe-action list, like rolling a deployment after a ConfigMap update) or pages a human with the root cause already identified, the diagnostic data attached, and a recommended action ready to approve. The same agent handles ImagePullBackOff (validates registry credentials, suggests fixes), OOMKilled (correlates with memory metrics, recommends limit changes), and Pending pods (analyzes scheduler events and proposes the right cluster scaling action).

The result is that the most common pod-crash modes resolve in seconds without any human intervention, and the genuinely novel cases get escalated to a human with full context instead of a 3 a.m. mystery. Try Nova to see it run on your real cluster.