LB Health Check Tuning

Frequency, threshold.

Overview

Load balancer health check tuning matches probe frequency, threshold, and timeout to actual backend behavior. Default settings work for some workloads and produce flapping or slow failover for others. The discipline is in distinguishing liveness (should the pod restart) from readiness (should this pod receive traffic), tuning thresholds against actual response patterns, and documenting the per-service rationale so the choices are reviewable.

Frequency and threshold. How often the probe fires and how many failures before unhealthy; tune against workload behavior, not framework defaults.
Probe path matters. /health for general health, /ready for "should I receive traffic", /live for "should I be restarted"; the path encodes the semantics.
Timeout matters. Probe timeout less than interval prevents overlapping probes; otherwise the probes pile up under stress.
Healthy/unhealthy thresholds plus per-backend semantics. N consecutive checks before state change avoids flapping; liveness/readiness/startup probes match different lifecycle questions.

The approach

The practical approach is short interval (5 to 15 seconds catches failures fast without taxing the backend), deliberate thresholds (2 unhealthy to mark down, 3 healthy to bring back, prevents flapping), separate liveness and readiness probes (liveness restarts the pod, readiness removes it from rotation), probe timeout strictly less than interval (no overlapping probes), and per-service health-check rationale documented in the service repo so the tuning is reviewable.

Short interval. 5 to 15 second interval; catches failures fast without taxing the backend with probe overhead.
Deliberate thresholds. 2 unhealthy to mark down, 3 healthy to bring back; the asymmetry prevents flapping under transient blips.
Separate liveness and readiness. Liveness restarts the pod; readiness removes it from LB rotation; conflating them produces wrong remediation.
Probe timeout less than interval plus documented policy. Probe timeout strictly less than interval to prevent overlap; per-service rationale committed for operational review.

Why this compounds

Health check discipline compounds across services. Each tuned check produces reliable failover where the default would flap; each per-service rationale survives team turnover; the team builds intuition for probe tuning that pays off on every new service. Without the discipline, every service ships with framework defaults and produces the same flapping incidents.

Failure detection. Right thresholds catch real failures fast; the LB removes the failing pod before the user sees it.
Release safety. Readiness probes support rolling deploys; new pods do not receive traffic until they are actually ready.
Reduced false positives. Right thresholds reduce flapping; the LB does not churn pods on transient blips.
Institutional knowledge. Each tuning teaches LB patterns; the team learns where probe defaults match their workload and where they do not.

Health check discipline is an operational discipline that pays off across years. Nova AI Ops integrates with LB telemetry, surfaces probe patterns, and supports the team’s reliability discipline.