Pod Eviction Debugging
Pod evicted? The debugging path.
Describe
Pod eviction debug is the discipline of investigating why pods are being evicted. Eviction is a signal; the cause varies; the fix depends on the cause. The discipline is identifying the cause quickly and applying the right remediation.
What investigation looks like:
- kubectl describe pod surfaces eviction event.: The describe output includes the eviction event with timestamp, reason, and message. The first stop in investigation is reading this event.
- Reads the most-recent reason.: The recent eviction's reason is the focus. Older evictions may indicate patterns; the most recent is the immediate concern.
- Eviction events stay briefly.: Kubernetes events have limited retention. The investigation should happen quickly; old eviction events may not be available.
- Cluster events too.: kubectl get events shows cluster-level eviction patterns. Multiple pods evicted from the same node indicate node-level issues; the broader pattern is informative.
- Tooling preserves events.: Some teams forward events to long-term storage. The historical record supports investigation of patterns that span longer windows.
The describe output is the starting point. Without it, eviction debugging is guesswork.
Reasons
Different eviction reasons require different fixes. The investigation matches the reason to the appropriate response; the discipline is recognizing the patterns.
- Memory pressure.: The node is running out of memory. The kubelet evicts pods to reclaim memory. The fix is sizing pods correctly (memory requests vs limits) and ensuring the node has adequate memory for the workload.
- Disk pressure.: The node is running out of disk space. The kubelet evicts pods to free disk. The fix is ephemeral storage limits (preventing one pod from consuming the disk) and node disk sizing.
- Voluntary disruption.: A drain operation evicted the pod. The cluster operation (upgrade, maintenance) intended the eviction; the pod was rescheduled elsewhere; no fix is needed.
- Each has different fix.: The fix depends on the reason. Memory pressure produces different remediation than disk pressure or voluntary disruption; the investigation matches.
- Cumulative patterns.: Repeated evictions for the same reason indicate systemic issues. The team should not just fix the symptoms; the patterns drive structural improvements.
The reasons are the categorization. Each category has its own fix; matching the fix to the category is the discipline.
Prevent
Prevention layers reduce eviction frequency. Resource limits, PDBs, anti-affinity, and priority classes all contribute; the layers together produce evictions that are bounded and intentional rather than common and disruptive.
- Resource limits.: Pods with appropriate memory and CPU limits are sized for their actual usage. The pods do not consume more than they should; the node-level pressure is bounded.
- PDB.: Pod Disruption Budgets prevent voluntary evictions from removing too many pods at once. The workload's availability is preserved during planned disruptions.
- Anti-affinity.: Anti-affinity rules prevent multiple pods of the same workload from landing on the same node. A node failure or eviction does not remove all replicas; the workload stays available.
- Higher priority class.: Critical workloads get higher priority class. When the node is under pressure, lower-priority pods are evicted first; the critical pods stay running.
- Layer them.: Each layer addresses a different scenario. The combination produces resilience; pods are evicted only when truly necessary, and the eviction does not produce customer impact.
Pod eviction debug is one of those Kubernetes operational disciplines that pays off in incident response. Nova AI Ops integrates with cluster events and pod telemetry, surfaces eviction patterns, and produces the per-workload visibility that the platform team uses to drive prevention.