Autoscaling That Doesn't Oscillate
An autoscaler that scales up at minute 1, down at minute 3, up at minute 5 is worse than no autoscaler. The cooldown, hysteresis, and metric-window settings that prevent the oscillation.
Why autoscalers oscillate
An oscillating autoscaler scales up, then scales down, then scales up again, every few minutes, all day. Capacity churns; instances boot only to be terminated; warm-up time wastes money; users see latency spikes during the churn. The pattern has three causes; understanding them tells you which knobs to turn.
Cause 1: target utilisation set too tight. Scaler is told to keep CPU between 60-65%. Real CPU is bursty; it crosses both thresholds in normal operation. Each crossing triggers a scale event. The fix is wider bands (50-75%) so normal noise doesn't trigger action.
Cause 2: noisy metrics. Scaler reads CPU every 10 seconds; raw values are spiky; one bad sample triggers a scale-up; the next sample looks fine and triggers scale-down. The fix is smoothing, average over a longer window so transient spikes don't cross thresholds.
Cause 3: scale-up drives the metric down (or vice versa). Adding a pod to a service immediately drops per-pod CPU because load is now spread across more pods. Without cooldown, the scaler sees "CPU dropped, scale down" and removes the pod it just added. Cooldown periods prevent this self-defeating cycle.
The diagnostic. Plot replica count over 24 hours. A healthy autoscaler shows smooth ramps and flat plateaus. An oscillating one shows a sawtooth pattern, and that's the smoking gun for one of the three causes above.
Three knobs every autoscaler exposes
Cooldown, hysteresis, and metric window. Every cloud provider's autoscaler has these (named differently). Tune them, don't just accept defaults.
Defaults matter because vendors choose them for "average workload", which is no real workload. Spend 30 minutes reading your autoscaler's docs and find each knob. The investment pays back in stable scaling for years.
Cooldown
Time between scale events. After scaling up, wait N minutes before scaling again. Prevents thrash when scale-up takes time to "show" in metrics.
How to set it. Cooldown should be at least as long as your warm-up time plus one metric window. If pods take 60 seconds to become ready and your metric window is 60 seconds, cooldown should be at least 120 seconds. Shorter cooldown means the scaler reacts to metrics that haven't yet absorbed the previous scale's effect.
Asymmetric cooldown. Scale-up and scale-down often need different cooldowns. Scale-up after a short cooldown (capacity matters during traffic surge); scale-down after a longer cooldown (capacity is cheap; thrashing is expensive). A 60-second scale-up cooldown and 300-second scale-down cooldown is a reasonable starting point.
The k8s HPA wrinkle. Kubernetes HPA's stabilisation window (`behavior.scaleDown.stabilizationWindowSeconds`) is the cooldown analog. Default is 300s for scale-down, 0s for scale-up. Most teams should leave scale-up at 0 (or set 30s) and bump scale-down to 600s for production workloads.
Hysteresis
Different thresholds for up and down. Scale up when CPU > 70%; scale down when CPU < 50%. The gap (20% here) is the hysteresis, it prevents oscillation around a single threshold.
How to set the gap. Wider gap = more stable, slower to right-size. Narrower gap = faster to right-size, more oscillation risk. Start with a 20-30% gap; narrow it once you've confirmed stability over a week.
The cost trade-off. Wider hysteresis means you run "over-provisioned" relative to a tighter scaler. For a fleet running at 75% target with 50% scale-down threshold, you're paying for capacity that sometimes sits at 50% utilisation. The cost of stability; usually worth it.
Working with predictive scaling. Some autoscalers (AWS Predictive Scaling, Azure equivalent) precompute capacity from historical patterns. They reduce reliance on hysteresis because they don't react to live spikes, they pre-scale. For predictable daily/weekly patterns, predictive scaling pairs well with looser hysteresis on top.
Metric window
How far back the scaler looks. A 60-second window with 10-second samples averages 6 samples before deciding. Smooths spikes; introduces lag.
Window length trade-offs. Short window (30s): scaler reacts fast; vulnerable to spikes. Long window (300s): stable; slow to scale during real traffic ramps. 60-120 seconds works for most web workloads. Background job processors benefit from longer windows (300-600s) because their work is naturally bursty.
The percentile choice. Some autoscalers can scale on p95 instead of mean. Mean is responsive to overall load; p95 is responsive to tail latency. For latency-sensitive services where p95 matters, scaling on p95 keeps tails honest at the cost of slightly more capacity.
Custom metrics. CPU and memory are the defaults; they're often the wrong metric. Queue depth, request rate, or active connections are usually better predictors of "do I need more capacity?" Most autoscalers (k8s HPA, AWS Application Auto Scaling) support custom metrics via Prometheus or CloudWatch.
The metric NOT to scale on
Average response time. Sounds reasonable; it's a trap. Adding capacity doesn't reduce response time linearly, sometimes it doesn't reduce it at all (DB-bound work, downstream-API-bound work). Scaling on response time creates feedback loops that don't converge.
The DB-bound failure mode. App pods scale up because response time is high. Response time is high because the database is slow. More app pods means more concurrent DB queries; database gets slower; response time gets worse; scaler adds more pods. The autoscaler is making the problem worse.
The right mental model. Scale on a metric that capacity actually fixes. CPU? Yes, more CPUs fix CPU saturation. Queue depth? Yes, more workers drain the queue. Response time? No, because response time depends on downstream resources you can't fix by adding pods.
The exception. If you have a profiled bottleneck and know it scales linearly with replicas, scaling on response-time-derived metrics can work. The exception requires actual measurement, not "we tried response time and it seemed to work."
The warm-up problem
JVM apps need to JIT. Caches need to fill. Even Go binaries need to load TLS configs. New instances spend 30-300 seconds being slower than warm ones. If autoscaler counts new instances as "fully contributing", it under-scales during ramps.
Kubernetes' answer. Readiness probes, pods don't take traffic until they're ready. The HPA shouldn't count not-ready pods toward capacity. Most modern HPA implementations get this right; older versions did not.
The warm-up period setting. AWS Application Auto Scaling has `EstimatedInstanceWarmup` (seconds before a new instance is counted). Kubernetes HPA has the equivalent in v2 behavior. Set this to your real warm-up time plus 30 seconds buffer; under-setting it makes the scaler thrash during ramps.
The pre-warming strategy. For predictable spikes (Monday 9am traffic surge, Black Friday), trigger scale-up 5-10 minutes before the spike. Custom CronJob in k8s, or AWS Predictive Scaling. Pre-warming converts a chaotic scale-up into a planned one; users don't see the warm-up latency.
Common antipatterns
Scaling on a metric the workload doesn't actually depend on. Memory-scaling a CPU-bound workload (or vice versa). Confirm the metric is the binding constraint before tuning the scaler around it.
No min-replica floor. Workload scales to zero at night; cold-start hits first morning request. Set min-replicas to 2-3 for any production workload, the saved cost isn't worth the cold-start UX hit.
No max-replica ceiling. A bug spikes traffic; scaler runs to 1,000 replicas; budget alarm fires. Set max-replicas to 3-5x your normal peak, high enough to absorb real surges, low enough to cap a runaway bug.
Custom metric pipeline that lags 5+ minutes. The scaler is acting on stale data; behaviour is unpredictable. Verify your metric end-to-end latency is under 60 seconds.
What to do this week
Three moves. (1) Plot replica count for your highest-traffic service over 7 days. If you see a sawtooth, your autoscaler is oscillating, start with widening the hysteresis gap. (2) Find the warm-up time for your service (boot to first successful request) and set the autoscaler's warm-up parameter to match plus 30 seconds buffer. (3) Audit min/max replicas. Both should be set explicitly; defaults of 1 and infinity are wrong for production.