Kubernetes By Samson Tanimawo, PhD Published May 14, 2025 9 min read

Kubernetes Probes Deep Dive: Liveness, Readiness, Startup, What Breaks and Why

Most Kubernetes outages involving pod restarts trace back to a misconfigured probe. Here is the difference between the three probe types and the two mistakes that break them.

The three probes, briefly

Each probe answers a different question and triggers a different action:

Liveness: is the process alive?

Liveness is the probe that causes restarts. Keep it simple and cheap, a TCP check on the main port is usually enough. An HTTP check that touches a database is a disaster waiting to happen: the database has a bad afternoon, every pod fails liveness, kubelet restarts them all, and now you have a stampede against a slow database.

Liveness probes should fail only when the process itself is wedged. If the dependency is down, that is readiness's job.

Readiness: can it handle traffic?

Readiness can be expensive and dependency-aware. It is fine for it to check “can I reach my primary cache, my database, my auth service?” because failing readiness just removes the pod from the load balancer, it does not kill the process.

A common pattern is to have the readiness endpoint flip to failing 30 seconds before shutdown to let in-flight connections drain. Combined with a terminationGracePeriodSeconds that matches, this eliminates the shutdown-related 502 spikes almost every service sees.

Startup: is it still booting?

Added in Kubernetes 1.16. If your app takes more than 30 seconds to start (JVM warmup, migration replay, large cache hydration), use startup probes so that liveness doesn't trigger restarts during boot.

Startup probe completes once, ever, per pod. After it succeeds, liveness and readiness kick in as normal. No more “the pod is in crashloop for the first two minutes, then works fine” nonsense.

The two mistakes that cause restarts

  1. Liveness probe depends on the database. Database slows, every pod restarts, cascading failure. Fix: liveness checks the process only; readiness checks dependencies.
  2. Liveness timeout too aggressive for GC pauses. JVM app, 2-second probe timeout, stop-the-world GC hits 3 seconds, kubelet kills. Fix: periodSeconds and timeoutSeconds tuned to your app's 99.9th percentile, plus startup probe for warmup.

Audit every deployment spec this quarter. These two mistakes account for more than half the pod-restart incidents most teams have.

These two mistakes account for more than half of the pod-restart incidents most teams have.

3
probe types
50%+
of restart incidents = 2 root causes

The audit checklist

Walk every deployment manifest in the cluster. For each one, answer two questions in writing.

Does the liveness probe touch a dependency? If yes, move that check to readiness and make liveness a simple TCP or in-process check.

Is the liveness timeout bigger than your 99.9th-percentile garbage collection pause? If no, raise it, or add a startup probe to cover warmup.