Canary vs Feature Flag
Two ways to reduce deploy risk.
Canary
Canary deploys and feature flags are both ways to reduce the blast radius of a change, but they operate on different layers and catch different classes of bugs. Treating them as alternatives leads to picking one and missing what the other would have caught. The right framing is that they are complementary, not redundant, and a serious deploy practice uses both.
What canary actually catches:
- Infrastructure-level deployment.: A canary releases the new artifact to a small percentage of production traffic (5%, then 25%, then 50%, then 100%). The split happens at the load balancer, the service mesh, or the ingress. The same code path that would run in 100% deployment runs against 5% of users to start.
- Catches infrastructure and runtime issues.: Memory leaks, startup failures, dependency incompatibilities, cluster scheduling problems. These are bugs that emerge from the way the binary runs, not from the logic it implements. Canary surfaces them within minutes of the deploy starting.
- Catches deploy-time regressions.: The new code interacts with existing data, existing schema, existing caches. Canary surfaces incompatibilities (a query that runs fine against test data but slows on real production data) within the first traffic step.
- Metric-gated promotion.: Each canary step is gated on metric analysis. Error rate, latency, saturation. If any metric degrades during the soak window, the canary halts and rolls back. This is the property that makes canary an automated safety net rather than a manual procedure.
- Granularity is traffic-percentage.: Canary cannot route specific users to the new version; it routes a random percentage of traffic. Any user might land on either version on any given request. This is fine for backend services where users do not perceive version differences in the data, and a problem for user-facing changes that need to look consistent within a session.
Canary is the right tool when the risk you are mitigating is at the binary or runtime level. It catches the bugs that are independent of which user happens to be making the request.
Feature flag
Feature flags operate inside the application code. They wrap a code path in a conditional that checks a runtime flag value. The new behavior runs only for users for whom the flag is on; everyone else gets the old behavior. The same binary produces both behaviors based on the flag.
- Application-level gating.: The flag check happens inside the running service, after the request has been routed. The flag service knows about user identities, account types, geographic regions, and any other dimension you have configured. The flag value is evaluated per request.
- Catches feature-level issues.: Flags catch bugs in the new feature itself: business logic regressions, UX problems, data shape mismatches that only appear when the feature is exercised. These are bugs that infrastructure-level canary cannot detect because the feature is not actually different from the infrastructure's perspective.
- Per-user, per-account, per-segment.: Roll out the feature to internal employees first, then beta customers, then 1% of paid customers, then 100%. Each cohort can be defined precisely. This granularity is impossible with canary, which is random per request.
- Stable for the user across requests.: A user who has the flag on sees the new behavior on every request, not randomly. Their session is consistent. This matters for any feature where the user can notice version differences (UI changes, workflow changes, anything stateful).
- Independent of deploy.: Flags can be toggled without deploying. A feature can ship dark (flag off) for weeks, then enable for a small cohort, then expand. Disabling a buggy feature is a flag flip, not a code rollback. Mean time to mitigate goes from minutes to seconds.
Feature flags are the right tool when the risk you are mitigating is at the feature behavior level. They catch the bugs that depend on which user is making the request.
Both
The mature deploy practice uses canary and feature flags together. Each catches a different failure mode, and the cost of running both is small once both are in place.
- Different layers, different bugs.: Canary catches "the new build crashes on startup," "memory usage doubled," "the GC pause is longer." Feature flags catch "the new pricing logic produces wrong totals for tier-2 customers," "the redesigned UI hides the save button on Safari." Neither alternative catches the other category.
- Layered defense.: A new feature ships behind a flag, off by default. The deploy of the new code goes through canary. Once canary completes (binary is healthy), the flag rollout begins (feature behavior is verified per cohort). Both layers fail closed; either can roll back independently.
- Different mean time to mitigate.: A canary rollback takes minutes (redeploy the previous artifact). A flag flip takes seconds. Critical features that need fast disable get flag protection on top of canary protection. The two layers reduce mitigation time end-to-end.
- Different audit trail.: Canary records the deploy event; flags record the rollout cohort. Together they tell the full story of how a change reached customers. For incident retros, both timelines are valuable.
- Cost is bounded.: Once you have a canary controller and a feature flag service, using both for every meaningful change is operationally cheap. The cost is in setting up the infrastructure once, not in using it per change.
Canary and feature flags are not competing tools. They are complementary defenses against different bug classes. Nova AI Ops integrates with canary controllers (Argo Rollouts, Flagger) and feature flag platforms (LaunchDarkly, Unleash, Statsig), correlates the deploy events with the flag rollout events on a single timeline, and surfaces the cross-pattern that distinguishes a canary-caught regression from a flag-caught feature bug.