Progressive vs Rolling: Decision Math

Cost vs safety in deployments.

Progressive and rolling are different

Rolling and progressive look similar but differ on traffic control. Rolling replaces pods batch by batch; progressive gates how much traffic the new version sees. Progressive is rolling plus a traffic-shifting layer that gives you the cancel button.

Rolling. Pod-by-batch replacement; no traffic distinction; new pods receive full production traffic immediately on readiness.
Progressive (canary). Fractional traffic split between new and old versions; new gets 1%, then 10%, then 50% as health gates pass.
Both can coexist. Progressive is rolling plus traffic gating; same underlying replacement primitive with extra control on top.
Documented mode per deploy. Explicit choice per service; the discipline catches "we always use progressive" applied where it does not earn its keep.

When rolling is enough

Rolling fits stateless, well-tested, lower-risk services. Partial-traffic gating does not add safety when the test suite already catches the regressions, and the operational simplicity is worth the trade.

Stateless with strong tests. Test-suite confidence makes partial-traffic gating redundant for many services.
Internal tools, low-impact services. A few minutes of bad pods is acceptable when the user surface is small.
Canary infrastructure absent. Rolling works with vanilla Kubernetes; progressive needs Argo Rollouts or a service mesh you have to operate.
Documented "rolling fits" rationale per service. Named driver per service; without it, "we should be using progressive but did not" becomes an incident.

When progressive wins

Progressive fits customer-facing, high-blast-radius, rare-bug-prone services. Automated rollback at the canary stage halts before customers feel the impact, which is the entire point.

Customer-facing services with SLOs. Auto-rollback before customer impact; progressive lets you halt at 1% rather than discover at 100%.
High-blast-radius changes. Database migrations, auth changes, payment-path updates; progressive caps the population that can be hurt.
Services with rare bugs. Surface to a small fraction first; rolling exposes everyone simultaneously and the rare bug becomes a major incident.
Documented "progressive required" tag per service. Explicit policy per service; the tag catches premature simplification when teams switch to rolling for speed.

Infrastructure cost

Progressive has real infrastructure cost. Service mesh or load balancer with traffic shifting, per-cohort metrics, and an operational learning curve all add up; do not adopt it where rolling already works.

Service mesh or LB with shifting. Istio, Linkerd, or Argo Rollouts plus Ingress; required infrastructure for the traffic-gating layer.
Per-cohort metrics. Canary-versus-baseline error and latency split; without per-cohort metrics, progressive adds no safety.
Operational learning curve. Progressive abstraction (analysis runs, gates, rollback policies) is real teaching cost for the team.
Named owner per cluster. Maintaining team for the progressive infrastructure; stale or misconfigured rollouts produce false confidence.

Decision rule

The decision is service-tier driven. Customer-facing tier-1 services run progressive; backend services with strong tests run rolling; migrations and shard rollouts progressive per region.

Customer-facing tier-1: progressive. User-impact-prevention rule; the canary catches what tests miss before customers see it.
Backend with tests: rolling. Test-confidence rule; operational simplicity wins where the test suite already catches regressions.
Database migrations: progressive per region, per shard. Controlled blast-radius rule for irreversible changes.
Do not progressive everything. Operational-cost-versus-benefit rule; mandating progressive on low-risk services pays infrastructure cost without the safety return.