Canary Time Window
How long to bake the canary.
Why time matters in canary
Canary time matters because some bugs only surface under specific traffic patterns. A 10-minute window during a quiet hour catches nothing useful; a canary that runs through a typical peak catches the load-dependent regressions that ship to production otherwise.
- Bugs need traffic patterns. Peak-hour, business-hour-batch, end-of-month dependencies per bug; some failures only show under specific load shapes.
- Quiet-window canaries miss. Under-loaded false-pass risk per canary; 10 minutes during a quiet window catches nothing because nothing is happening.
- Match window to traffic shape. Workload-relevant duration per canary drives meaningful signal rather than ritual canary running.
- Documented goal per canary. Named "what we want to surface" target per canary catches ritual canaries that pass meaninglessly.
Canary duration guidance
Duration scales with blast radius. Low risk runs in minutes; medium-risk covers a workload peak in tens of minutes; high-blast-radius changes need hours, and database migrations sometimes need days.
- Low blast radius: 10-15 minutes. Cheap-rollback case per deploy; brief canary suffices for changes where reversal is fast.
- Medium: 30-60 minutes. Workload-spike-cover case per deploy; captures a typical traffic peak before promoting.
- High blast radius: 2-24 hours. Peak-traffic-window cover per deploy; database migrations may need 7+ days for replication lag to fully expose.
- SLO check window per canary. Metric observation period per canary catches latent regressions that take time to surface.
When to run the canary
Schedule canaries when the team is around to react. Avoid Friday afternoons, weekends without coverage, and quiet windows that produce false-pass results because nothing is happening.
- Avoid Friday afternoon. No-Friday-late rule per canary; if it fails at 5pm the team is gone and the rollback happens with whoever is around.
- Avoid weekends without coverage. Weekend-coverage check per canary; bug discovery time matters when the on-call rotation is thin.
- Match to workload peak. E-commerce evening, banking business hours alignment per canary drives realistic signal.
- Explicit start time per canary. Documented schedule per canary catches improvised timing that produces inconsistent signal.
Rollback timing
Rollback discipline matches canary discipline. Auto-rollback fires within 5 minutes on SLO regression; rollback that takes 15+ minutes itself indicates a process problem worth practicing.
- Auto-rollback within 5 minutes. SLO-regression trigger per canary; auto-rollback fires fast rather than waiting for human eyes.
- Slow rollback signals trouble. Over-15-minutes diagnostic per canary; slow rollback indicates a process problem to fix in drills, not during incidents.
- Do not extend on ambiguous results. Promote-or-rollback rule per canary; ambiguous is not "wait longer," it is "rollback and investigate."
- Named decision authority per canary. Responsible engineer or auto-rule per canary catches "no one decided" stalls.
How to set windows by deploy type
Window choice scales with deploy type. Hot config changes need 15 minutes minimum; code deploys cover a peak window; data migrations need days for replication and downstream effects to fully expose.
- Hot config change. 15-minute minimum per deploy includes one cycle of the metrics you care about.
- Code deploy. 1-4 hours per deploy covering peak; the standard pattern for non-trivial changes.
- Data migration. 24-48 hours per deploy; replication lag, read paths, and write paths each need observation through real workload cycles.
- Documented window per deploy. Explicit canary-time decision per deploy catches "we just used the default" without thinking about it.