Warm Spare vs Cold Spare: Recovery Time Tradeoffs

Spare resources cost money. The decision between warm (running) and cold (provisioned on demand).

Warm spare

Warm spares run continuously, ready to take traffic in seconds. Fastest failover, highest cost. The choice when seconds of downtime are unacceptable; the wrong choice when minutes are fine and the spare cost compounds across the year.

Running and ready. Live capacity. Failover in seconds when the primary fails.
Cost is full and continuous. Always-on resource cost. Active-active doubles capacity spend.
Synthetic traffic per spare. Periodic exercise. Catches the “warm but actually broken” failure mode that hides until failover.
Quarterly failover drill. Actual cutover test. Catches latent failover bugs before a real incident does.

Cold spare

Cold spares are provisioned but not running. Failover takes minutes (boot, warm caches, register with load balancer). Significantly cheaper. The right choice when minutes of downtime are acceptable.

Provisioned but not running. Dormant capacity. Failover in minutes.
Cost is storage and config only. Significantly cheaper than warm. Compounds across the year.
Documented bring-up runbook. Boot-and-warm-up steps documented. Catches the “cold spare we never actually started” failure mode.
IaC parity with production. Same Terraform definition as production. Drift between primary and spare disappears.

Decide

The decision is criticality-driven. Warm for paths where seconds matter; cold for everything else; audit regularly to catch the over-provisioning that creeps in when teams default to warm.

Critical paths get warm. Where seconds of downtime are unacceptable, warm is required.
Less-critical paths get cold. Where minutes are acceptable, cold delivers real cost savings.
Quarterly warm-spare audit. Many teams over-provision warm. Downgrade where seconds-versus-minutes is genuinely acceptable.
Named owner per spare. Responsible team explicit. Catches the “everyone’s-and-no-one’s” failure mode.