HPA Tuning for Real Workloads

Default HPA settings are conservative. The tuning that catches bursts.

Metrics

HPA scales only as well as the metric pointed at it. CPU is the default and the wrong choice for most user-facing services. RPS, queue depth, in-flight requests, or p95 latency predict real load far better than CPU does.

CPU is default. CPU-utilisation autoscale per deployment. Cheapest to wire up; weakest predictor of real load.
Custom metrics. RPS, queue depth, in-flight requests per service. Better predictors for actual load.
Latency-based HPA. p95 latency target per user-facing service. Scales when user experience degrades, not when CPU does.
Source of truth per metric. Prometheus adapter or external metrics API. Pick the dial deliberately.

Thresholds

Threshold tuning is asymmetric. Add capacity fast at 60-70 percent utilisation to keep headroom; remove capacity slowly with a 5-minute stabilisation window so transient dips do not cause flapping.

Scale-up at 60-70 percent. Headroom buffer per deployment. Triggers before saturation.
Scale-down delay. 5-minute minimum stabilisation per deployment. Prevents flapping on transient dips.
Asymmetric posture. Eager-up, conservative-down per deployment. Capacity is cheaper than tail latency.
Load-test calibration. Threshold validated under synthetic load per service. Catches over-tight thresholds before prod.

Avoid

Three failure modes recur. Aggressive scale-down causes flapping; untuned stabilisation defaults break bursty workloads; default min and max replicas produce scale-to-zero or runaway scale-up surprises.

Aggressive scale-down. No-fast-shrink rule per deployment. Flapping wastes capacity and ages pods needlessly.
Stabilisation windows tuned per service. Bursty workloads need different windows than steady-state.
Explicit min and max replicas. Per deployment, not defaults. Catches scale-to-zero or runaway scale.
At-cap alarm. Alert per deployment when max is hit. Catches saturation before customers do.