Right-Sizing Cloud Compute: A Data-Driven Quarterly Cadence
A one-time right-sizing exercise saves money for a quarter. A repeating cadence saves money forever. The discipline is small; the payback compounds.
Why right-sizing rots
You right-size in Q1 and save 18% on compute. By Q3 the savings have eroded to 4%. By Q1 next year, you are back where you started. Not because anything failed, because traffic patterns changed, workloads were added, autoscalers got tweaked, and nobody re-checked.
The discipline that holds: a quarterly review with a fixed scope and an owner. Without it, drift wins.
The four data points per workload
p99 CPU utilization over 30 days. Not average, average always looks fine. p99 tells you whether you have headroom for spikes.
Memory working set p99. Same logic. Average memory rarely catches the OOM-killer event.
Burst frequency. Times per week the workload exceeded 80% utilization. Frequent bursts mean a smaller instance will trip latency SLOs.
Cost. Current monthly spend per workload. Not just instance cost, include storage, network, and any per-resource fees.
The 90-day loop
Week 1: Pull the four metrics for every workload above $500/month. Smaller workloads add to noise; ignore them.
Week 2: Identify candidates. Rule of thumb: p99 CPU under 40% AND p99 memory under 50% AND zero bursts above 80% in 30 days = downsize one instance tier.
Week 3: Plan downsize batches. Group by team, batch into change windows, write the rollback plan.
Week 4-6: Execute, one batch per week. Watch for 7 days post-change before moving to the next batch.
Week 12: Retro. What saved? What broke? What policy needs updating?
Downsize without paging on-call
The single most important practice: change one workload at a time and watch it for 7 days before moving on. Cluster downsizes always look fine for 48 hours; reality kicks in on the next traffic spike or weekly batch job.
The rollback discipline. Before downsizing, write a rollback runbook (one command, ideally). Test it on a non-production instance. The downsize is reversible by design.
The off-hours rule. Do not downsize the night before peak traffic. Sounds obvious; gets violated all the time.
Antipatterns
One-time right-sizing. The discipline matters; the snapshot does not. Cadence beats heroics.
Optimising for average utilization. Average is fine; p99 is what kills you. Always look at the percentile.
Right-sizing without an SLO. Without a reliability target, you have no defence against "we should not have downsized that." Set the SLO; defend the decision against it.
What to do this week
Three moves. (1) Pull p99 CPU + memory for your top-10 most expensive workloads. (2) Pick the most-overprovisioned candidate; plan the downsize for next week. (3) Add the 90-day cadence to a calendar with an owner; the cadence dies without one.