Liveness Probe Restart Loops
Bad liveness causes restart loops. The pattern.
Symptoms
Liveness restart loops are a recurring Kubernetes operational pattern. The pod restarts repeatedly; the application never reaches a healthy state; the customer impact compounds. The discipline is recognizing the pattern and applying the correct fix.
What the symptoms look like:
- Pod restarts repeatedly.: The pod's restart count grows. Each restart goes through the same startup sequence; the cycle repeats; the pod never serves traffic.
- Logs show same startup sequence.: Looking at the pod's logs, the team sees the same startup messages repeatedly. Each restart starts fresh; the application never gets past startup; the pattern is the symptom.
- CrashLoopBackOff.: Kubernetes' status reflects the cycle. The pod is in CrashLoopBackOff state; subsequent restarts have increasing delays; the cluster is preventing constant restart.
- Restart count is the metric.: The team's monitoring catches the high restart count. The alert fires; the investigation begins; the fix is applied.
- Customer impact.: If the workload is customer-facing, the restart loop produces customer impact. The pod cannot serve traffic; capacity is reduced; the team responds urgently.
The symptoms are recognizable. Once the team sees the pattern, the cause and fix follow.
Cause
The cause is usually a misconfigured liveness probe. The probe fires before the application is ready; Kubernetes restarts the pod; the cycle continues.
- Liveness fails before startup completes.: The application takes some time to start. The liveness probe is configured to start checking too early; the application has not started yet; the probe fails; Kubernetes restarts.
- Or downstream check that is always failing.: Some liveness probes check downstream dependencies. If the dependency is unavailable, the probe fails; the pod restarts; restarting does not fix the dependency.
- Probe is too sensitive.: The liveness probe might be too sensitive to transient issues. Brief network blips cause the probe to fail; the pod restarts; legitimate transient issues become outages.
- Liveness vs readiness confusion.: Sometimes the team uses liveness when they meant readiness. Readiness gates traffic; liveness restarts the pod; the wrong choice produces restart loops.
- Investigation flow.: Check the liveness probe configuration; check the application's actual startup behavior; identify the mismatch; apply the fix.
Understanding the cause is the foundation. Without it, fixes are speculative.
Fix
The fixes are well-known. Startup probe to allow startup time; liveness initial delay; sometimes disabling liveness entirely.
- Add startup probe.: The startup probe runs only at startup. While the startup probe is active, the liveness probe is suspended. The application has time to start; the liveness probe activates only after startup completes.
- Increase liveness initial delay.: The initialDelaySeconds field on the liveness probe specifies how long to wait before checking. Increasing it gives the application time to start; the probe checks only after startup is reasonable.
- Disable liveness if startup is the only issue.: Some applications do not benefit from liveness probes. They self-recover without restart; the liveness probe just produces unnecessary restart loops; removing the probe eliminates the loop.
- Test the fix.: The team verifies the fix works. The application starts correctly; the liveness probe does not produce restarts; the pod stays up.
- Document the pattern.: The team's runbooks include this pattern. Future encounters of the same symptom produce faster diagnosis; the institutional knowledge is preserved.
Liveness restart loops is one of those Kubernetes operational patterns that pays off when recognized. Nova AI Ops integrates with cluster pod telemetry, surfaces restart patterns, and supports the team's diagnosis discipline.