LB Health Check Tuning

Frequency, threshold.

Overview

Load balancer health check tuning matches probe frequency, threshold, and timeout to actual backend behavior. Default settings work for some workloads and produce flapping or slow failover for others. The discipline is in distinguishing liveness (should the pod restart) from readiness (should this pod receive traffic), tuning thresholds against actual response patterns, and documenting the per-service rationale so the choices are reviewable.

The approach

The practical approach is short interval (5 to 15 seconds catches failures fast without taxing the backend), deliberate thresholds (2 unhealthy to mark down, 3 healthy to bring back, prevents flapping), separate liveness and readiness probes (liveness restarts the pod, readiness removes it from rotation), probe timeout strictly less than interval (no overlapping probes), and per-service health-check rationale documented in the service repo so the tuning is reviewable.

Why this compounds

Health check discipline compounds across services. Each tuned check produces reliable failover where the default would flap; each per-service rationale survives team turnover; the team builds intuition for probe tuning that pays off on every new service. Without the discipline, every service ships with framework defaults and produces the same flapping incidents.

Health check discipline is an operational discipline that pays off across years. Nova AI Ops integrates with LB telemetry, surfaces probe patterns, and supports the team’s reliability discipline.