The Spot Fleet Diversification Strategy
Single-instance-type spot fleets get hit hard during interruptions. Diversify across types and AZs to keep capacity stable.
Instance type diversity
Spot Fleet diversification is the discipline of distributing spot capacity across many instance types and availability zones. The diversification reduces interruption risk: any single spot pool can be interrupted; spreading capacity across many pools means no single interruption affects much of the fleet. Done well, spot fleet runs at near-on-demand reliability with significant cost savings.
What instance type diversity provides:
- Pick 3 or more instance types of similar size and capability.: The fleet specifies multiple compatible instance types: m5, m5a, m5n, m6i. Each is roughly equivalent (general-purpose, similar memory and CPU); spot capacity flows from whichever pools have availability.
- Interruption probability per type is independent.: Each instance type has its own spot pool with its own interruption pattern. m5 interruption does not correlate with m5n interruption; the patterns are independent. Diversification reduces aggregate risk.
- Diversification reduces aggregate risk.: A fleet that uses one instance type sees full impact when that pool interrupts. A fleet across 5 types sees only fractional impact. The aggregate interruption rate drops significantly with diversification.
- Mix sizes if workload tolerates.: Some workloads tolerate variable pod counts: 4 large instances or 8 medium instances both meet capacity. The fleet specifies multiple sizes; the cluster autoscaler handles the variance.
- Mix architectures (Graviton).: ARM-based instances (m7g, c7g) can be added to the diversification mix if the workload supports both architectures. The Graviton spot pools are different from x86 pools; the diversification benefit extends to architecture.
Instance type diversity is the foundation of spot fleet stability. The more diverse the fleet, the more resilient it is to pool-specific interruptions.
AZ diversity
Spot pools are AZ-specific. Each AZ has its own pool of each instance type with its own pricing and interruption pattern. AZ diversification spreads risk across the AZ dimension as well.
- Across all AZs in the region.: The fleet runs in every AZ of the region. AWS publishes spot pricing per AZ; the fleet pulls from whichever AZ has the cheapest available capacity.
- Spot pricing differs by AZ.: Different AZs have different demand patterns; pricing varies. An AZ-balanced fleet captures pricing variations; the average price across all AZs is typically lower than any single AZ.
- AZ-balanced fleet smooths cost.: The fleet's blended cost is the volume-weighted average of the AZs it runs in. The smoothing reduces volatility; price spikes in any single AZ have bounded impact.
- Some workloads have AZ-affinity needs.: Workloads with significant AZ-local data access (data on EBS volumes pinned to an AZ, services that consume from same-AZ caches) may need to constrain to specific AZs. The diversification pattern adapts to these constraints.
- Plan accordingly.: Where AZ affinity is required, diversification happens within fewer AZs (or within instance types within one AZ). The discipline still applies; the dimension shifts.
AZ diversity is the second dimension of fleet diversification. Combined with instance type diversity, it produces broad coverage across the spot landscape.
Allocation strategy
The allocation strategy tells the fleet how to choose among the available pools. The strategy determines the trade-off between cost and stability; different workloads benefit from different strategies.
- capacity-optimized: best for stability.: The fleet allocates from pools that have the deepest available capacity. Pools with deep capacity are less likely to interrupt; the strategy favors stability over cost. Production workloads typically use capacity-optimized.
- lowest-price: best for cost.: The fleet allocates from the cheapest available pool. The strategy minimizes cost but may pull from shallow pools that interrupt more often. Batch workloads with high tolerance for interruption can use lowest-price.
- Higher interruption rate.: lowest-price often pulls from pools that are nearly exhausted; the cheaper price reflects the imminent capacity shortage. The strategy trades stability for additional savings.
- price-capacity-optimized as the middle ground.: A newer strategy that balances both signals. Allocates from pools that are both cheap and have capacity. The middle ground produces good cost with reasonable stability; many production teams default to this.
- Test the strategy.: Run the fleet with each strategy in non-production; observe interruption rates and costs; choose the strategy that matches the workload's tolerance. The right strategy is workload-specific.
Spot fleet diversification strategy is one of the highest-leverage cost optimizations available for fault-tolerant workloads. Nova AI Ops integrates with EC2 spot interruption data and fleet metrics, surfaces diversification gaps, and helps teams identify fleets that are over-concentrated in vulnerable pools.