Canary Metric Gates
Canary deploy gates per metric.
Metrics
The point of canary deploys is to catch regressions before they affect all users. The mechanism that catches them is the metric gate: an automated check that compares the canary's behavior to the baseline and decides whether to advance. Without metric gates, canary is just "deploy slowly and hope someone is watching"; with them, canary becomes a real safety system.
The metrics that actually matter at the gate:
- Latency.: p50, p95, p99 of the canary versus the baseline. A regression of 50 ms at p99 sounds small but is a real degradation that users will perceive. The gate compares both populations directly; a statistically significant difference fails the gate.
- Error rate.: 5xx rate, 4xx rate (depending on whether 4xx is a service issue or a client issue), specific error types if instrumented. The canary's error rate is compared to the baseline's. A spike in errors that only the canary sees is the strongest signal of regression.
- Business KPIs.: The metrics that tie technical changes to business outcomes. Conversion rate. Add-to-cart rate. Successful payment rate. These are the metrics most teams forget to gate on, and they catch the bugs that look fine technically but break the user flow in subtle ways.
- Per-canary baseline.: The baseline for comparison is the production traffic NOT served by the canary, sampled at the same time. Comparing canary to historical data introduces time-of-day and day-of-week noise; comparing canary to live baseline cancels out that noise.
- Saturation metrics.: CPU, memory, GC, connection pool utilization on the canary. A regression that pushes the canary toward saturation surfaces here even when it has not yet caused a user-visible failure. Catching these early prevents the canary from becoming the thing that breaks production.
The metric set is small, specific, and per-service. Generic gates (5xx less than 1%) are weaker than service-specific gates (checkout success rate within 0.5 percentage points of baseline). The investment in calibrating metrics per service is what makes the gate trustworthy.
Thresholds
Each metric has a threshold and a comparison method. The threshold is what separates "the canary is fine" from "the canary is regressing." Setting thresholds correctly is the work of calibrating canary; setting them wrong produces either false-pass (regression slips through) or false-fail (good deploys get blocked).
- Each metric has a threshold.: Not just a single global threshold; per-metric, per-service. Latency threshold might be "no more than 5% increase in p99"; error threshold might be "no more than 0.1 percentage point increase in 5xx rate"; business KPI threshold might be "no more than 1% drop in conversion." Each is calibrated separately.
- Statistical significance, not just absolute.: A 0.5% drop in conversion rate could be noise on a small canary. The threshold uses statistical tests (Mann-Whitney U for distributions, two-proportion z-test for rates) to confirm the difference is real before failing the gate. Without statistical significance, gates flap on noise.
- Threshold breach pauses ramp.: When any metric breaches its threshold, the canary stops advancing. It does not necessarily roll back immediately; the system holds at the current ramp percentage to let the team investigate. After the soak window, if the breach persists, rollback fires.
- Automated, not advisory.: The threshold check runs continuously during the canary. The deploy pipeline reads the result. The decision to advance, hold, or roll back is automatic. A human can override but the default is the automation, not the manual call.
- Tuned over time.: Thresholds drift. As the system evolves, the right threshold for "fine" shifts. The team reviews thresholds quarterly: which gates fired falsely, which let real regressions through, which need adjusting. The discipline is calibration, not set-and-forget.
Thresholds done right are why canary is trusted enough to run unattended. Thresholds done wrong produce a system the team mistrusts, which leads to manual override, which defeats the whole point.
Decide
The decision phase is where the gate either advances the canary or rolls it back. The discipline is to keep this fast, automated, and trustworthy. A slow or human-bottlenecked decision phase reintroduces all the problems canary was supposed to solve.
- Pass: ramp.: When all metric gates pass for the configured soak window, the canary advances to the next traffic percentage. The advancement is automatic; no human approval is required. The next stage starts immediately and the gates re-evaluate against the new baseline.
- Fail: rollback.: When any gate fails for sustained duration past the soak window, the canary rolls back. The previous version takes traffic; the canary version is removed. The deploy is marked failed; the engineering team is notified with the gate that fired.
- No human in the inner loop.: The advance and rollback decisions happen without waiting for human approval. Humans set the policy (which metrics, what thresholds, how long the soak); the system enforces it. The human approves the start of the deploy and the final promotion to 100%, but not every intermediate step.
- Override available, not default.: The on-call can manually advance or roll back if they see something the gates do not. The override path is documented and audited. But the override is the exception, not the routine. Routine canaries flow through automation.
- Postmortem on every failed canary.: When a canary rolls back, the team writes a brief retro: which gate fired, what the regression was, what changed in the code, how to prevent it next time. The retros build the institutional knowledge of what kinds of changes need extra care.
Canary metric gates are the property that makes progressive delivery actually safe. Without them, canary is theater; with them, canary is the safety net that lets the team ship fearlessly. Nova AI Ops integrates with canary controllers (Argo Rollouts, Flagger, Spinnaker) to evaluate metric gates against the SLO definitions you already use, and surfaces the per-canary decision history so the team can see which kinds of changes routinely fail at which gates.