Spot Instances at Scale: When the Savings Are Real

Spot pricing is genuinely amazing for the right workloads. The wrong workload makes the savings vanish into operational toil.

Why spot is so cheap

Cloud providers sell unused capacity at deep discount with the right to take it back on 2 minutes notice. The discount is real (60-90% off on-demand). The catch is the interruption.

For workloads that tolerate interruption, the savings are pure. For workloads that do not, spot creates expensive incidents.

Workloads where spot wins

Stateless web tiers. Behind a load balancer; pod gets killed; another spins up.
Batch processing. Naturally checkpointable; restarts cheap.
CI/CD runners. Jobs are idempotent; reschedule is fine.
Bursty analytics. Get a cluster cheap; do the work; release.

Workloads where spot is a trap

Stateful databases. Two minutes is not enough to drain a primary safely.

Long-running jobs without checkpoints. 6-hour ML training that must restart from scratch on interrupt.

Anything with a strict SLO. Even diversified, spot has tail interruption events.

Diversification math

Diversify across instance families and AZs. The probability of all spot pools being interrupted simultaneously is the product of individual probabilities, very low when diversified.

Modern tooling (Karpenter, AWS Compute Optimizer) handles diversification automatically; the manual era is over.

Antipatterns

One instance type, one AZ. Maximum interruption pain.
Spot for the database. Even hourly checkpoints lose minutes of work.
Ignoring interruption notifications. Two minutes is enough to drain gracefully if you wired it up.

What to do this week

Three moves. (1) Pick the most exposed instance of the pattern in your environment. (2) Apply the lightest fix and measure for one week. (3) Schedule a quarterly review so the discipline does not rot.