Running Production on Spot Instances Without Pain
Spot instances cut compute cost 70-90%. They also disappear with two minutes' notice. The patterns that let you run real production on spot without 3am pages.
Four workload classes
Not every workload survives spot. Categorise yours before deploying, picking wrong is how teams discover their batch jobs are actually critical at 3am.
The classification's value. Each workload has a different tolerance for interruption. Stateless services can absorb interruption with minor disruption. Stateful services with replicas can rebuild quickly. Single-instance critical services cannot tolerate any interruption. Putting the wrong workload class on spot is how teams get burned.
The exercise. Inventory workloads; for each, ask: how long does it take to restart? What's the data-loss risk? What's the customer-visible impact of a 30-second interruption? The answers map workloads to the four classes below.
Spot-ready
Stateless web servers, batch processors, ML training. Reschedule somewhere else when interrupted; latency tolerant.
The defining trait. The workload's state is external (database, object store) or recomputable (batch job from inputs). Interrupting one instance and starting another elsewhere is a cheap operation. Customer impact: minimal, at most a single retried request.
The savings. Spot pricing is 60-90% off on-demand. For a stateless web fleet of 50 instances at $200/month each on-demand, that's $10k/month. On spot at 80% savings, $2k/month. Annualised, $96k saved with no operational downside.
Spot-tolerable
Stateful caches, event consumers with replay capability. Need ~30 seconds to drain or rebuild; brief disruption is OK.
The pattern. The workload has state, but the state is recoverable. A cache loses some hits temporarily; an event consumer replays from a checkpoint. The 30-second drain window is enough to gracefully shut down before reclamation.
The implementation. Subscribe to the spot-interruption signal (2-minute warning). On the warning, the workload begins draining: cache writes a snapshot, consumer commits its checkpoint. By the 2-minute mark, the workload is in a clean state to be reclaimed. New instance starts elsewhere; resumes from checkpoint.
Spot-hostile
Database primaries, in-flight payment processors, anything with a long startup. Possible on spot but only with significant engineering effort.
The trade-off. The savings are still attractive at scale, but the engineering cost is high. A spot-aware database primary requires: synchronous replication to a hot standby; automatic failover on the 2-minute warning; client retry logic. Each component is significant work; the failover mechanism is itself a reliability liability.
When to attempt. Only when the spot savings are large enough to justify months of engineering investment. Most teams don't reach this threshold; they keep databases on on-demand and live with the cost.
Spot-impossible
Single-instance workloads with no peer to fail over to. Anything customer-facing without a queue / replica behind it. Don't try.
The structural reason. Spot reclamation is non-negotiable. The instance will be gone in 2 minutes. If the workload can't tolerate that, spot is impossible. No amount of engineering changes the fundamental constraint.
The pragmatic response. Either re-architect the workload to have a peer (move to spot-tolerable class) or keep it on on-demand. Trying to make a single-instance critical workload spot-friendly through "clever engineering" is how teams produce reliability disasters.
Diversification
Spot pools by instance type. Spread across 3-5 instance families and 2-3 AZs. The interruption rate per family is independent; diversifying drops blended interruption rate dramatically.
The math. Single instance type, single AZ: ~10% monthly interruption rate (varies by region/family). Diversified across 5 families and 3 AZs: blended interruption rate drops to ~2% monthly. The improvement is real because reclamations are correlated by family/AZ.
The implementation in Kubernetes. Use the spot.io plugin or AWS Karpenter to manage diversified spot fleets. Configure 5+ instance families per spot fleet; auto-scaler balances across them. The complexity is in the configuration; once set up, it runs.
The 2-minute warning
AWS sends a metadata signal 2 minutes before reclamation. Most teams don't subscribe. Monitor the spot-interruption topic, drain the pod / process before the kill, and you turn a hard interruption into a graceful one.
The mechanism. AWS publishes the warning on the instance metadata service (and on EventBridge). A daemon on the instance polls the metadata; on warning, it triggers a graceful shutdown of pods. Kubernetes receives the SIGTERM; pods drain; new instance starts elsewhere.
The implementation. Run aws-node-termination-handler (or equivalent) on every node. It handles the warning subscription, the SIGTERM propagation, and the cordon/drain. Open-source; battle-tested; install once.
The savings of graceful shutdown. Without the handler: pods killed mid-request; clients see errors; data potentially lost. With the handler: pods drain over 30-60 seconds; clients see a brief retry but no errors; data preserved. The handler turns spot from risky to routine.
Common antipatterns
Spot for everything because it's cheap. Stateless web → fine. Database primary → disaster. Match workload class to spot-tolerance.
No interruption handling. Spot interruptions kill pods abruptly; clients see errors. Always run the termination handler.
Single instance type. One bad day for that family means significant interruption. Diversify across 5+ families minimum.
The "we'll move it back to on-demand if it doesn't work" plan. Plan after the first incident. Better: classify before deploying; only put spot-ready and spot-tolerable on spot from the start.
What to do this week
Three moves. (1) Inventory workloads against the four classes. Most teams find some "we shouldn't be on spot" candidates. (2) For workloads that should be on spot but aren't, run the math: monthly savings vs. engineering cost to make them spot-tolerable. The number is usually compelling. (3) Install aws-node-termination-handler (or equivalent) on every spot node. Cheap insurance.