Readiness Probe Failure Modes
Bad readiness probes break things. The modes.
Flapping
Readiness probes have characteristic failure modes. The discipline is recognizing them and applying the right fix; misconfigured readiness produces traffic-routing problems that affect users directly.
What flapping looks like:
- Probe passes/fails rapidly.: The probe alternates pass and fail quickly. The pod cycles between Ready and NotReady states; the cycle is fast; the pod is unreliable.
- Pod cycles in/out of endpoints.: When the pod is Ready, it is in the service's endpoints. When NotReady, it is removed. The cycling produces traffic disruption; some requests reach the pod, others fail.
- Tune timeout.: The probe's timeoutSeconds may be too short. A network blip causes timeout; the probe fails; the pod is removed; the next probe succeeds; the cycle starts.
- Tune threshold.: failureThreshold determines how many consecutive failures count as failure. Increasing it makes the probe less sensitive; transient failures do not produce instability.
- Investigation flow.: Check the probe configuration; check the application's actual response time; identify the mismatch; apply the fix.
Flapping is the most visible failure mode. The cycle is recognizable; the fix is configuration tuning.
Dependent
The dependent failure mode is when readiness checks downstream services. The pod's readiness depends on something else; when the something else fails, all the pods become unready; the failure cascades.
- Readiness checks downstream service.: The readiness probe verifies the downstream service is reachable. The intent is reasonable: do not serve traffic if the downstream is unavailable.
- Cascading: downstream down means all pods unready.: When the downstream goes down, every pod's readiness probe fails. All pods are removed from endpoints; the service is completely unavailable; the failure cascades.
- Better to serve degraded.: The team's choice depends on the use case. Often, serving with degraded functionality is better than serving nothing; the readiness probe should reflect this.
- Circuit breakers instead.: If the team wants to fail-fast on downstream issues, circuit breakers in the application code are better than readiness probes. The pod stays Ready; the application handles the downstream issue.
- Investigation reveals.: When all pods become unready simultaneously, suspect a downstream dependency. The investigation reveals the cascade; the fix removes the dependency from readiness.
Dependent readiness probes are a recurring anti-pattern. The discipline is keeping readiness self-focused.
Design
The design principle is that readiness should reflect the pod's own readiness, not its dependencies. Liveness and readiness are about self; downstream concerns belong elsewhere.
- Readiness should reflect this pod's readiness.: The pod's readiness is "I am ready to serve traffic." Whether the team's downstream services are healthy is separate; readiness should not conflate them.
- Not dependencies.: Dependency health is observable separately. Application metrics, downstream service monitoring, distributed tracing all surface dependency issues; readiness is not the right place.
- Liveness/readiness for self only.: Both probes are about the pod itself. Liveness is "should I be restarted?" Readiness is "should I receive traffic?" Neither should depend on external systems.
- Document the rule.: The team's standards include this rule. New services follow it; existing services that violate it are remediated; the discipline is consistent.
- Test the design.: The team tests what happens when downstream services fail. Pods should stay ready; the application should handle the failure; users see degradation, not complete unavailability.
Readiness failure modes is one of those Kubernetes operational disciplines that pays off when readiness probes are designed correctly. Nova AI Ops integrates with cluster pod telemetry, surfaces readiness patterns, and supports the team's probe-design discipline.