Progressive vs Rolling: Decision Math
Cost vs safety in deployments.
Progressive and rolling are different
Rolling and progressive look similar but differ on traffic control. Rolling replaces pods batch by batch; progressive gates how much traffic the new version sees. Progressive is rolling plus a traffic-shifting layer that gives you the cancel button.
- Rolling. Pod-by-batch replacement; no traffic distinction; new pods receive full production traffic immediately on readiness.
- Progressive (canary). Fractional traffic split between new and old versions; new gets 1%, then 10%, then 50% as health gates pass.
- Both can coexist. Progressive is rolling plus traffic gating; same underlying replacement primitive with extra control on top.
- Documented mode per deploy. Explicit choice per service; the discipline catches "we always use progressive" applied where it does not earn its keep.
When rolling is enough
Rolling fits stateless, well-tested, lower-risk services. Partial-traffic gating does not add safety when the test suite already catches the regressions, and the operational simplicity is worth the trade.
- Stateless with strong tests. Test-suite confidence makes partial-traffic gating redundant for many services.
- Internal tools, low-impact services. A few minutes of bad pods is acceptable when the user surface is small.
- Canary infrastructure absent. Rolling works with vanilla Kubernetes; progressive needs Argo Rollouts or a service mesh you have to operate.
- Documented "rolling fits" rationale per service. Named driver per service; without it, "we should be using progressive but did not" becomes an incident.
When progressive wins
Progressive fits customer-facing, high-blast-radius, rare-bug-prone services. Automated rollback at the canary stage halts before customers feel the impact, which is the entire point.
- Customer-facing services with SLOs. Auto-rollback before customer impact; progressive lets you halt at 1% rather than discover at 100%.
- High-blast-radius changes. Database migrations, auth changes, payment-path updates; progressive caps the population that can be hurt.
- Services with rare bugs. Surface to a small fraction first; rolling exposes everyone simultaneously and the rare bug becomes a major incident.
- Documented "progressive required" tag per service. Explicit policy per service; the tag catches premature simplification when teams switch to rolling for speed.
Infrastructure cost
Progressive has real infrastructure cost. Service mesh or load balancer with traffic shifting, per-cohort metrics, and an operational learning curve all add up; do not adopt it where rolling already works.
- Service mesh or LB with shifting. Istio, Linkerd, or Argo Rollouts plus Ingress; required infrastructure for the traffic-gating layer.
- Per-cohort metrics. Canary-versus-baseline error and latency split; without per-cohort metrics, progressive adds no safety.
- Operational learning curve. Progressive abstraction (analysis runs, gates, rollback policies) is real teaching cost for the team.
- Named owner per cluster. Maintaining team for the progressive infrastructure; stale or misconfigured rollouts produce false confidence.
Decision rule
The decision is service-tier driven. Customer-facing tier-1 services run progressive; backend services with strong tests run rolling; migrations and shard rollouts progressive per region.
- Customer-facing tier-1: progressive. User-impact-prevention rule; the canary catches what tests miss before customers see it.
- Backend with tests: rolling. Test-confidence rule; operational simplicity wins where the test suite already catches regressions.
- Database migrations: progressive per region, per shard. Controlled blast-radius rule for irreversible changes.
- Do not progressive everything. Operational-cost-versus-benefit rule; mandating progressive on low-risk services pays infrastructure cost without the safety return.