Capacity Planning Without Spreadsheets
The annual capacity-planning spreadsheet is one of those rituals everyone hates and nobody questions. The teams that handle capacity well replaced it with a continuous, smaller process.
The annual ritual and what it actually delivers
The annual capacity spreadsheet collects estimates from every team, multiplies by a fudge factor, presents to leadership, gets approved. By Q3 it is wrong because traffic moved differently, and by Q4 it is irrelevant. The team did the work; the work did not deliver capacity decisions.
The structural failure. Annual spreadsheets assume traffic is predictable and team needs are stable. Both are false. The forecast that's accurate at quarter 1 is stale at quarter 2 and wrong at quarter 3. By the time the spreadsheet is produced, the world has moved.
The cultural cost. Engineers spend 2-3 weeks in October producing capacity estimates that nobody trusts. Leadership reviews the spreadsheet and approves a fraction of what was requested. Engineers feel their effort was wasted; leadership feels engineers always ask for too much. Both are right; the process is the problem.
Three rolling forecasts
Replace it with three smaller, continuously updated forecasts: 30-day capacity, 90-day capacity, and 12-month directional. Each one is a different question. The 30-day says "do we need to provision in the next sprint?" The 90-day says "is the trajectory worth a quarterly review?" The 12-month is hand-wavy and that's fine.
The horizon trade-off. Short horizons are accurate but only useful for tactical decisions. Long horizons are inaccurate but needed for strategic decisions (next year's budget, next year's hiring). Three horizons match three different decision types.
The continuous-update discipline. Each forecast updates monthly (or more often for the 30-day). The monthly review is small (an hour); it adjusts based on the previous month's actuals. Compounded over a year, the rolling forecast is dramatically more accurate than the annual spreadsheet because it gets to incorporate every month's data.
Peak vs sustained
Most failures are at peak, not at sustained load. Plan for the daily peak (typically 2-4x sustained) plus enough headroom that the auto-scaler can keep up without panic. Forecasting only sustained load is how teams get caught at lunch on a Tuesday.
The peak math. A service with 1000 RPS sustained typically peaks at 2500-4000 RPS during the daily high. If capacity is sized for sustained, the peak hits 2-4x oversubscription. If the system has 30% headroom at sustained, peak puts it at 130%-260% — failure mode.
The leverage move. Plot p99 of the per-minute RPS over a representative week. The peak isn't theoretical; it's there in the data. Size for this peak plus 30%; the auto-scaler handles the rest.
The reorder-point model
From supply chains. For each tier of capacity, define the threshold at which you order more. When utilisation exceeds the threshold, the procurement is automatic. No deliberation, no exception. The decision was already made; the threshold is the contract.
The pattern in operation. Compute reorder point: 70% utilisation. When sustained utilisation hits 70% for >3 days, automation provisions more capacity (or pages an engineer to do so). The team doesn't debate "should we add capacity?"; the threshold made the decision in advance.
The discipline of pre-deciding. Most capacity disasters happen when the team hesitates at the right moment. "Are we sure we need to order more?" The reorder point removes the hesitation; the team committed to the trigger when they set the threshold.
Headroom by tier
Not all capacity is equal. Stateless compute can grow in minutes; databases take days. Set headroom by lead time. Compute: 30% headroom is fine. Databases: 50% or more, because procurement plus migration plus warming takes longer than your traffic spike.
The lead-time framing. Compute scales in 60-180 seconds (cloud APIs are fast). Databases scale in hours-to-days (provisioning, replication catch-up). Persistent volumes can take days to migrate. Each tier needs enough headroom to cover its own lead time during a traffic spike.
The expensive lesson. Teams set 30% headroom across the board, assuming all tiers behave like compute. The first major traffic spike, compute auto-scales fine but the database hits 100% utilisation and starts dropping queries. Compute had headroom; the database didn't have time to grow. Match headroom to lead time.
Monthly review, not annual
One hour a month, not one week a year. Look at the rolling forecasts, the reorder-point hits, and the headroom by tier. Decide if any thresholds should move. Document. Done.
The discipline of compactness. The temptation in the monthly meeting is to rebuild the spreadsheet from scratch. Resist it. The meeting is for adjustments, not redos. Reviewing the previous month's actuals against the previous forecast, then adjusting forward, takes minutes.
The cumulative effect. Twelve monthly reviews of an hour each take 12 hours per year — far less than the 80+ hours an annual planning cycle consumes. AND the result is more accurate because it incorporates each month's data continuously.
Common antipatterns
The "we'll provision when we need it" approach. Reactive provisioning means engineers debate during outages whether to scale. The reorder point removes the debate; without it, decisions slow down at the wrong moment.
Capacity planning ignored by deployment teams. Capacity team produces forecasts; deploy team ignores them and ships features that double load anyway. Capacity planning has to feed into product/feature planning, not be a separate document.
Single capacity number for all dimensions. "Service A needs 4 more pods." But Service A's bottleneck is database connections, not pods; adding pods doesn't help. Decompose capacity into actual bottlenecks: connections, queries, queue depth, disk IOPS. Each has its own forecast.
The "we have unlimited cloud" mindset. Cloud is elastic but not free. A team that auto-scales without limits gets cloud bills that dwarf the engineering team. Set both upper bounds (max capacity) and cost alarms (alert when bill exceeds budget).
What to do this week
Three moves. (1) Pick your most critical service. Define its rolling 30-day capacity forecast: current utilisation, trend, projected breach date. The first attempt is rough; iterate monthly. (2) Set a reorder point. "When sustained utilisation exceeds X%, we provision more." Document the threshold; commit to it. (3) Cancel next year's annual capacity-planning week. The team's calendar is the most visible signal that the planning has changed; freeing up that week is the move.