Progressive Rollout Stages
Specific stages for ramp.
Stages
Progressive rollout is the deploy pattern of routing increasing fractions of traffic to a new version while monitoring for problems. The structure of the stages (what percentages, in what order, with what gates between them) determines how quickly a regression is caught. The stages should be calibrated to the service's risk profile, not picked arbitrarily.
What standard rollout stages look like:
- 1%, 5%, 25%, 50%, 100%.: The conventional five-stage rollout. The first stage (1%) catches catastrophic regressions while the blast radius is tiny. Each subsequent stage doubles or quadruples the exposure, with metric gates between stages to confirm the previous stage was healthy.
- Standard for most services.: The 1-5-25-50-100 progression matches the risk reduction most teams want: very small initial exposure, rapid expansion once confidence is established, full rollout before too long. Modern canary controllers (Argo Rollouts, Flagger, Spinnaker) ship this as a default and most teams accept it.
- Two-stage for low-risk services.: Internal admin tools, dev-environment services, and other low-impact workloads can use shorter progressions: 50%, then 100%. The risk of partial rollout pain is small enough that the simpler progression saves operational time without meaningful safety loss.
- Six or seven stages for highest-risk services.: Payment services, identity infrastructure, the data plane that everything depends on. These warrant 0.5%, 1%, 5%, 10%, 25%, 50%, 100%. The extra stages increase confidence at the cost of longer rollout duration; for these services the trade is worth it.
- Per-region progression.: Multi-region services typically progress through stages in one region before starting any percentage in the next. The region dimension adds another layer of progressive rollout on top of the percentage dimension.
The stages are not magic numbers. They are deliberately chosen to balance speed of rollout against risk of partial-fleet exposure. Each service should pick the progression that matches its specific risk profile.
Per-stage metrics
The stages alone do not provide protection. The metric gates between stages are what actually catch regressions. Each gate is a deliberate check against the canary's behavior; failing the check pauses or rolls back the rollout.
- Each stage has gate metrics.: Latency p99, error rate, saturation, business KPIs (where applicable). The metrics are computed for the canary versus the baseline (the production traffic not on the canary). A statistically significant degradation fails the gate.
- Pause if any metric breaches.: When a gate fails, the rollout pauses at the current stage. It does not advance until either the metric recovers or the on-call manually intervenes. The default behavior is conservative; advancing requires clean signal.
- Roll back if breach is severe.: Some breaches are pause-worthy; some are roll-back-worthy. A 10% increase in p99 latency is a pause; a 10x increase is a roll back. The thresholds are tuned per service.
- Different metrics at different stages.: The 1% stage is too small for some metrics to be statistically meaningful (specifically business KPIs that need volume). At 1%, latency and error-rate gates are useful; conversion-rate gates are not. The metric set narrows or widens by stage.
- SLO-aware gates.: The gate thresholds reference the SLO. A canary that would push the SLO budget into burn fails the gate, even if the absolute numbers look acceptable. This ties deploy-time decisions directly to the reliability commitment.
The metric gates are the active intelligence in progressive rollout. Without them, the rollout is just "deploy slowly"; with them, it is "deploy slowly and check at each step."
Time per stage
The duration of each stage is the third axis. Longer stages produce more confidence; shorter stages produce faster rollout. The right answer depends on how quickly regressions surface in the metrics.
- 10 to 30 minutes default.: Most stateless services soak each stage for 10 to 30 minutes. Long enough to gather enough samples for the metric gates to be statistically meaningful; short enough that a full rollout completes within an hour or two.
- Tuned per service.: Services with low traffic need longer stages to gather enough samples. Services with cached responses need longer stages because cache hit rates take time to stabilize. Services with infrequent business events (rare API calls, batch operations) need longer stages because the events have to occur within the window.
- Hours for stateful services.: Services with caches, replication lag, or operational warm-up phases may need each stage to run for hours. The full rollout extends to a day or longer. The trade-off is real but the stateful nature of the service requires it.
- Shorter for hot fixes.: When deploying a fix to a known production incident, the full progressive rollout is too slow. The team can collapse stages: 5% for 5 minutes, 50% for 5 minutes, 100%. The reduced safety is balanced against the cost of extending the incident.
- Time-of-day matters.: Stages that span the daily traffic peak collect more samples than stages that span the overnight low. Some teams pause rollouts overnight and resume in the morning; others run continuously. The choice depends on whether the team values safety or speed more.
Progressive rollout stages, with per-stage metric gates and well-tuned durations, are the deploy pattern that turns risk into a managed quantity. Nova AI Ops integrates with progressive rollout controllers (Argo Rollouts, Flagger), evaluates SLO-aware metric gates between stages, and surfaces the per-stage decision history so the team can see which kinds of changes routinely fail at which gates.