Debugging Kubernetes Pod Crashes: A Triage Tree
When a pod is in trouble, the difference between a 5-minute fix and a 5-hour debugging session is knowing exactly which question to ask first. Here is the triage tree we follow at 3 a.m., with kubectl commands and root-cause patterns for each branch.
Step 0: Get the Pod Status
Every pod-crash investigation starts with the same three commands. Memorize them.
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
The first command tells you the high-level state (Running, Pending, CrashLoopBackOff, etc.). The second tells you the events Kubernetes recorded about the pod, including image pulls, scheduling decisions, and health-check failures. The third gives you the application logs from the previous (crashed) container instance, which is where the actual error message lives.
The pod's high-level status determines which branch of the triage tree you take. Each section below covers one branch.
CrashLoopBackOff: The Application Is Failing
What it means: The container started, ran briefly, exited (or was killed), and Kubernetes is repeatedly restarting it with exponentially increasing backoff delays.
First command:
kubectl logs <pod-name> -n <namespace> --previous
This shows the last log output from before the crash. 80% of CrashLoopBackOff diagnoses end here. Look for a stack trace, an unhandled exception, a config file parse error, or a "connection refused" message pointing at a missing dependency.
Common root causes:
- Missing or misconfigured env var: The application requires DATABASE_URL or API_KEY but the value is empty or wrong. Fix: check the Deployment manifest and the referenced ConfigMap or Secret.
- Database not reachable: "Connection refused on db.svc:5432." Fix: validate the Service exists, the DNS resolves, and the database is actually running.
- Migration failure on startup: The app runs DB migrations as part of init and one fails. Fix: check the migration log, fix the schema, redeploy.
- Wrong entrypoint: The container starts but the command exits immediately. Fix: check the Dockerfile CMD and the Deployment's command/args fields.
- Permission denied on a mount: The app cannot write to /var/data because the volume is read-only or owned by the wrong UID. Fix: check the securityContext and PersistentVolumeClaim.
Useful follow-up commands:
kubectl describe pod <pod-name> -n <namespace> | grep -A10 "Last State"
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
ImagePullBackOff: The Image Cannot Be Fetched
What it means: Kubernetes tried to pull the container image and failed. The pod will retry with backoff until it succeeds or you fix the underlying issue.
First command:
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Events:"
The events section will tell you exactly why the pull failed. The five most common reasons:
- "manifest unknown" / "404 not found": The image tag does not exist in the registry. You may have a typo, or you pushed the wrong tag. Fix: check the image name and tag, run
docker pull <image>from your laptop to verify. - "unauthorized" / "401": The cluster cannot authenticate to a private registry. Fix: create or fix the imagePullSecrets reference in the Pod spec or ServiceAccount.
- "connection timeout": The cluster cannot reach the registry, usually because of a network or proxy issue. Fix: check egress rules, proxy configuration, and DNS resolution from inside the cluster.
- "forbidden" / "403": The image exists but the credentials lack permission. Fix: validate the registry IAM policy or robot account permissions.
- "toomanyrequests" / "rate limit exceeded": Docker Hub anonymous pull rate limit. Fix: authenticate with a Docker Hub account or mirror the image to a private registry.
OOMKilled: The Container Exceeded Memory Limit
What it means: The kernel's OOM killer terminated the container because it tried to use more memory than its resources.limits.memory allowed.
First command:
kubectl describe pod <pod-name> -n <namespace> | grep -i "OOMKilled"
kubectl top pod <pod-name> -n <namespace> --containers
Look for "Reason: OOMKilled" in the Last Terminated State. Then pull historical memory metrics from your monitoring system to understand the pattern.
Three common scenarios and the right fix for each:
- Limit too low for normal workload: The app legitimately needs more memory than the limit allows. Fix: raise the limit. Use Vertical Pod Autoscaler in recommendation mode for the right value.
- Memory leak in the application: Usage grows linearly until the kill. Fix is application-side; the limit just contains the blast radius. Investigate with a memory profiler.
- Memory spike from input data: The app is fine 99% of the time but a large input causes a memory blow-up. Fix: implement input size limits, streaming processing, or per-request memory budgets.
Common gotcha: JVM and other runtimes that respect cgroup memory limits need explicit configuration. A Java app with -Xmx not set will use up to 25% of host memory, which is usually larger than the container limit. Always set -Xmx to roughly 70-75% of the container memory limit.
Pending: The Pod Cannot Be Scheduled
What it means: The scheduler cannot find a node that satisfies the pod's requirements (resource requests, node selectors, affinity rules, taints, or volume bindings).
First command:
kubectl describe pod <pod-name> -n <namespace> | grep -A20 "Events:"
The scheduler writes detailed messages explaining exactly why each candidate node was rejected. Read them carefully.
Common root causes:
- "Insufficient cpu/memory": No node has enough free resources. Fix: scale up the cluster (add nodes), reduce the pod's requests, or kill other pods.
- "node(s) had untolerated taint": The pod is missing a toleration for a taint on the only suitable nodes. Fix: add the toleration or remove the taint.
- "node(s) didn't match pod affinity rules": A pod or node affinity rule cannot be satisfied. Fix: relax the affinity to
preferredDuringSchedulingIgnoredDuringExecutionor fix the topology. - "unbound immediate PersistentVolumeClaims": A PVC in the pod cannot bind to a PV. Fix: check the StorageClass exists, the provisioner is healthy, and the requested size is available.
- "nodes are available, but didn't have ready volumes": The pod uses a zonal PV but the only suitable node is in a different zone. Fix: use multi-zone storage or a topology-aware scheduler.
Init Container Failures
What it means: One of the pod's init containers failed, blocking the main containers from starting.
First command:
kubectl logs <pod-name> -c <init-container-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Init Containers"
Init containers run sequentially and any failure blocks pod startup. Common patterns:
- Wait-for-DB init that times out: The app is waiting for a database that is not yet ready. Fix: extend the wait timeout, or use proper readiness gating instead of init containers.
- Migration init that fails on a schema conflict: A migration cannot apply cleanly. Fix: roll back the migration, fix the schema, redeploy.
- Permissions init that fails because the volume mount is read-only: An init container tries to
chowna file system but the volume is RO. Fix: usefsGroupin the pod's securityContext instead of init-time chown.
Liveness Probe Failures
What it means: The container started successfully but the kubelet is killing it because the liveness probe is failing.
First command:
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Liveness"
kubectl logs <pod-name> -n <namespace> --previous
You will see "Liveness probe failed" events with the probe's HTTP status, exec exit code, or TCP connection error. This indicates either:
- The probe is wrong: Wrong path, wrong port, wrong scheme. Fix: validate the probe definition against the application's actual health endpoint.
- The probe is too aggressive: The application takes longer than the probe's
initialDelaySecondsto start. Fix: increase the initial delay or use a startupProbe to give the app time to boot before the liveness probe activates. - The application is genuinely unhealthy: The probe correctly detects that the app is hung or deadlocked. Fix is application-side.
Common gotcha: A liveness probe that calls a database can cause cascading failures. If the database goes down, every pod fails its liveness probe and gets restarted, often making the database situation worse. Liveness probes should test only the application's own readiness, not external dependencies. Use readiness probes (which take the pod out of service rotation but do not restart it) for dependency checks.
Common Gotchas (DNS, ConfigMap, Secrets)
Three failure patterns that account for a disproportionate share of pod incidents:
1. DNS resolution intermittently fails: CoreDNS is overloaded, or NodeLocal DNSCache is misconfigured. Symptoms: random "no such host" errors that come and go. Fix: scale CoreDNS, deploy NodeLocal DNSCache, increase the pod's dnsConfig.options.ndots if FQDNs are slow.
2. ConfigMap or Secret update does not take effect: Mounted ConfigMap and Secret values update lazily (typically within 60 seconds), but most applications read them at startup and never re-read. Symptom: you updated the ConfigMap, but the pod is still using the old value. Fix: roll the deployment after the update, or use a config reload sidecar.
3. Secret volume mount missing keys: The Secret exists but a specific key is absent. Symptom: a file in the mount path is missing entirely (not empty, not present). Fix: validate the Secret has the expected keys with kubectl get secret -o yaml.
When to Bring in Observability Tools
The triage tree above gets you 80% of the way for most pod crashes. The remaining 20% are the genuinely hard cases: intermittent failures, multi-pod cascades, or anomalies that only show up under production load. For these, you need observability tools that look across pods and clusters.
The three signals that move investigations beyond kubectl describe:
- Cluster events from kube-state-metrics: Restart counts, OOMKilled rates per workload, failed schedule attempts. Surfaces patterns invisible at the per-pod level.
- Application traces: A flame graph showing where a request actually spent time. Indispensable for "the app is restarting because of slow startup" diagnoses.
- Recent change events: Deployments, ConfigMap edits, image updates correlated with the incident timeline. The "what changed" question is the most useful single question in production debugging.
How Nova AI Ops Auto-Resolves These
Nova AI Ops automates this triage tree. When a pod enters CrashLoopBackOff, the Workload Diagnostics agent runs the same investigation steps you would: fetch the previous logs, parse the stack trace, check recent ConfigMap/Secret changes, validate resource limits against historical usage, and compare against similar past incidents.
The agent then either applies a fix automatically (if the trust score for that service permits and the change is in the safe-action list, like rolling a deployment after a ConfigMap update) or pages a human with the root cause already identified, the diagnostic data attached, and a recommended action ready to approve. The same agent handles ImagePullBackOff (validates registry credentials, suggests fixes), OOMKilled (correlates with memory metrics, recommends limit changes), and Pending pods (analyzes scheduler events and proposes the right cluster scaling action).
The result is that the most common pod-crash modes resolve in seconds without any human intervention, and the genuinely novel cases get escalated to a human with full context instead of a 3 a.m. mystery. Try Nova to see it run on your real cluster.