The Alert Canary Pattern

A simple canary alert verifies the alerting pipeline.

What an alert canary is

An alert canary is a synthetic alert that fires every 30 or 60 minutes by design. If the canary stops firing, the alerting pipeline itself is broken; it validates the chain (metric source, evaluator, alertmanager, paging integration, on-call schedule) and a single canary covers all of it. Without a canary, a broken pipeline is silent and you discover it when an outage doesn’t page.

How to wire it up

The wiring is straightforward. Cron a metric that flips a boolean every 60 minutes plus an alerting rule that fires on the flip; or use the alerting tool’s built-in heartbeat (PagerDuty heartbeats, Opsgenie heartbeat API); route the canary to a low-priority channel not the on-call rotation because the signal is the absence of the canary not the canary itself; alert when no canary has been seen in 90 minutes.

Watching the canary watcher

The watcher must be independent. The thing that detects a missing canary must run on different infrastructure from the alerting pipeline because the same outage that kills alerts kills the canary check; cheap pattern is a scheduled GitHub Action or tiny external uptime monitor that checks PagerDuty’s API for a heartbeat in the last 90 minutes; when the canary stops, page on the meta-alert (the only time meta-alerting is justified).

What the canary catches

Three failure classes show up. Prometheus rule files with syntax errors that silently fail to load; Alertmanager routes that drop alerts due to misconfigured matchers; integration keys that have rotated, on-call schedules that lapsed without a backup, Slack webhooks that 401.

How to roll it out this week

Three steps roll out the canary. Add one canary per critical alerting path (separate Datadog and Prometheus alerting need one canary on each); document the canary in the on-call runbook (new on-calls need to know the silent fire is the dangerous one); test it by deliberately disabling the canary and confirming the meta-alert pages within 90 minutes.