Progressive Delivery: Canary, Blue-Green, and Feature Flags Compared
All three are sold as the same thing. They are not. Each fails in a different way; each fits a different kind of change. Here is the matrix.
Three patterns, three different jobs
The shorthand "progressive delivery" lumps three patterns together: canary (gradual percentage rollout), blue-green (instant cutover with a fallback), and feature flags (in-app conditional logic). They all reduce blast radius, but they reduce different kinds of blast radius. Mixing them up leads to picking the wrong one and being surprised when it fails.
Canary deployment
One percent of traffic to the new version. Watch metrics. If healthy, ramp to 5%, 25%, 50%, 100%. If errors, abort.
Best for: infrastructure-level changes (new dependency version, refactored hot path, runtime upgrade) where the change might cause systemic failure visible in error rate or latency.
Failure mode: blind to changes that hit only specific users (a tenant whose data triggers the bug). The 1% sample might miss it; you ship to 100% and the affected tenant's traffic finally hits the bug at 5pm Friday.
Tooling: Argo Rollouts, Flagger, Istio's traffic-splitting, Linkerd's traffic-split. All do the same thing, with varying ergonomics.
Blue-green deployment
Stand up the new version (green) alongside the old (blue). Swap the load balancer atomically. If errors, swap back.
Best for: changes that cannot run side-by-side (database schema migrations that require all-or-nothing, breaking API changes that need a clean cut).
Failure mode: doubles infrastructure cost during the cutover window. Schema migrations that span both versions need backward-compatible expansion-then-contraction (the dance is well-documented and worth doing right).
Tooling: any load balancer plus careful orchestration. Modern platforms (Kubernetes Services with selectors, AWS ELB target groups) make the swap a metadata change.
Feature flags
Code paths gated by runtime configuration. The new feature ships dark to 100% of users, then enabled for 1%, 10%, etc. Independent of the deploy.
Best for: product changes (new UI, new pricing, new flow) where you want to vary exposure by user, segment, or geo. Also for kill-switches on risky features that have already shipped.
Failure mode: flag debt. The dark code paths accumulate; cleanup never happens; someone six months later flips a flag they forgot was wired to dead code. Strict expiry policies and quarterly reviews are non-negotiable.
Tooling: LaunchDarkly, Statsig, Unleash, GrowthBook. All work; the choice is more about ecosystem fit than features.
The matrix: which to pick when
Hot-path code change with no API surface change: canary. The metrics will tell you within 5 minutes whether it is broken.
Schema migration or breaking API change: blue-green plus expansion-then-contraction. Side-by-side does not work.
New product feature, want to vary exposure: feature flag. Decouples the deploy (whenever) from the launch (when product is ready).
Combo case (most real launches): ship behind a feature flag using a canary deploy. The deploy is risk-managed by canary; the launch is risk-managed by flag. They are not redundant.
Antipatterns
Canary as the only mechanism for product launches. A canary at 1% traffic is not a product test. The user-experience question deserves a feature flag.
Blue-green for routine deploys. The infrastructure cost is real. Reserve for changes that need it.
Feature flags without expiry dates. The flag was meant for a 2-week launch; it is still in the codebase 14 months later. Set TTLs.
What to do this week
Three moves. (1) Audit your last 10 deploys. Categorise each as "should have used canary / blue-green / flag / mix." Pattern your tooling to those. (2) Set a flag-cleanup quarterly cadence; pick the next one and do it. (3) Add automated rollback triggers to your canary tooling, if error rate exceeds X for Y minutes, roll back without human input.