Canary Noise Tolerance
Don't fail canary on minor noise.
Noise floor thresholds
Canary noise tolerance is the discipline of distinguishing real degradation from natural variance. Without tolerance, every traffic blip causes rollback; with tolerance, real issues trigger rollback while noise does not.
What thresholds look like:
- 1-2% deviation accepted as noise.: Below this threshold, no canary failure. The discipline accepts that some variance is normal; small deviations do not produce rollback.
- Per-metric tuning.: Different metrics have different normal variance. Latency typically varies more than error rate; the thresholds match the metric's actual variance.
- Latency tolerates more noise than error rate.: A small latency spike is normal; a small error rate spike often indicates real issues. The thresholds reflect this asymmetry.
- Business hours vs off-hours.: Off-hours has lower traffic; the percentage variance is higher because the absolute numbers are smaller. The thresholds adjust for this.
- Document per-metric thresholds.: The team's documentation captures why each threshold is what it is. Future tuning has context; the discipline is preserved.
The thresholds are the foundation. Without them, canary alerts fire on noise.
Statistical significance
Beyond simple thresholds, statistical significance prevents random spikes from triggering rollback. The team's discipline includes statistical rigor.
- Need stat-sig delta to fail canary.: A delta between canary and baseline must be statistically significant to trigger rollback. Random spikes that do not meet significance are not rollback-triggering.
- Random spikes do not trigger rollback.: The discipline filters out random variance. Real issues persist; noise is filtered; the rollback rate matches actual issue rate.
- Sample size matters.: Low-traffic services need longer canary windows to accumulate enough samples. The discipline calibrates window size to traffic; statistical significance requires the data.
- Confidence intervals: 95% threshold for rollback.: The standard threshold is 95% confidence. The team's certainty is high before rollback; false rollbacks are rare.
- Tunable per criticality.: Critical services may use higher confidence (99%); less critical may use lower (90%). The threshold matches the cost of false rollback vs missed real issues.
Statistical significance is the rigor. The discipline is data-driven rollback.
Auto-tuning baselines
Static thresholds are starting points. Auto-tuned baselines learn from historical variance; the discipline becomes adaptive.
- Track baseline noise per metric.: The team tracks each metric's normal variance over time. The historical data builds a noise model; future deviations are compared to it.
- Tune thresholds based on historical variance.: The thresholds adjust based on the noise model. Metrics with high natural variance get looser thresholds; quiet metrics get tighter; the discipline matches reality.
- Detect deviations from learned baseline.: Beyond fixed thresholds, the system detects deviations from the learned normal pattern. The detection is more sensitive than fixed thresholds for some metrics.
- Beats fixed thresholds.: Auto-tuned baselines outperform fixed thresholds in many cases. The discipline learns; the alerts improve over time.
- Re-baseline on intentional changes.: A new service version invalidates the old baseline. The team triggers a re-baseline; the system learns the new normal; the discipline accommodates change.
Auto-tuning produces adaptive discipline. The thresholds match the workload's actual behavior.
Manual override
Beyond automation, manual override exists. Engineers can promote or fail canaries based on judgment; the discipline includes the override path.
- Engineer can promote canary early if confident.: The engineer's judgment can outweigh the metrics. Confidence in the change; smooth rollout; the engineer accelerates promotion.
- Skip remaining stages.: Manual promotion skips the remaining canary stages. The team's discipline allows this when justified; the override is logged.
- Engineer can fail canary manually.: The engineer observing issues outside the metrics can fail the canary. Customer reports, qualitative observation; the discipline allows the engineer's judgment to trigger rollback.
- All overrides logged.: Manual overrides are logged. The discipline produces audit trail; future analysis references the overrides.
- Quarterly review of override patterns.: The team reviews overrides quarterly. Patterns emerge; the discipline evolves; future automation incorporates the patterns.
Canary noise tolerance is one of those deployment safety disciplines that pays off in deploy reliability. Nova AI Ops integrates with deployment platforms, surfaces canary patterns, and supports the team's deployment discipline.