The Alert Canary Pattern
A simple canary alert verifies the alerting pipeline.
What an alert canary is
An alert canary is a synthetic alert that fires every 30 or 60 minutes by design. If the canary stops firing, the alerting pipeline itself is broken; it validates the chain (metric source, evaluator, alertmanager, paging integration, on-call schedule) and a single canary covers all of it. Without a canary, a broken pipeline is silent and you discover it when an outage doesn’t page.
- Synthetic periodic alert. Fires every 30-60 minutes by design; the heartbeat.
- Chain validation. Metric source, evaluator, alertmanager, paging integration, on-call schedule.
- Silent pipeline without canary. Broken pipeline discovered when outage doesn’t page.
- Single canary covers chain. Whole-chain test; the cheapest validation.
How to wire it up
The wiring is straightforward. Cron a metric that flips a boolean every 60 minutes plus an alerting rule that fires on the flip; or use the alerting tool’s built-in heartbeat (PagerDuty heartbeats, Opsgenie heartbeat API); route the canary to a low-priority channel not the on-call rotation because the signal is the absence of the canary not the canary itself; alert when no canary has been seen in 90 minutes.
- Cron a flip metric. Plus alerting rule fires on flip; the simplest implementation.
- Vendor heartbeat APIs. PagerDuty heartbeats, Opsgenie heartbeat; built-in.
- Low-priority channel. The canary itself isn’t the signal; absence is.
- 90-minute gap SLA. No canary in 90 minutes triggers meta-alert.
Watching the canary watcher
The watcher must be independent. The thing that detects a missing canary must run on different infrastructure from the alerting pipeline because the same outage that kills alerts kills the canary check; cheap pattern is a scheduled GitHub Action or tiny external uptime monitor that checks PagerDuty’s API for a heartbeat in the last 90 minutes; when the canary stops, page on the meta-alert (the only time meta-alerting is justified).
- Independent infrastructure. Different from alerting pipeline; same outage doesn’t kill both.
- GitHub Action or external monitor. Tiny, scheduled; queries PagerDuty API for heartbeat.
- Page on meta-alert. Only justified meta-alerting; canary loss is high-stakes.
- Per-watcher placement. Documented external location; supports continued independence.
What the canary catches
Three failure classes show up. Prometheus rule files with syntax errors that silently fail to load; Alertmanager routes that drop alerts due to misconfigured matchers; integration keys that have rotated, on-call schedules that lapsed without a backup, Slack webhooks that 401.
- Prometheus rule syntax errors. Silent load failures; rules don’t fire.
- Alertmanager route misconfig. Misconfigured matchers drop alerts silently.
- Integration key rotation. PagerDuty keys rotate; alerts stop reaching paging.
- Schedule lapses and webhook 401s. On-call without backup; Slack webhook auth expires.
How to roll it out this week
Three steps roll out the canary. Add one canary per critical alerting path (separate Datadog and Prometheus alerting need one canary on each); document the canary in the on-call runbook (new on-calls need to know the silent fire is the dangerous one); test it by deliberately disabling the canary and confirming the meta-alert pages within 90 minutes.
- One canary per alerting path. Datadog and Prometheus separate need one each.
- Runbook documentation. New on-calls know silent fire is the dangerous one.
- Disable-and-test. Confirm meta-alert pages within 90 minutes.
- Per-quarter test. Disable test rerun each quarter; supports continued correctness.