The Alert Canary Pattern
A simple canary alert verifies the alerting pipeline.
What an alert canary is
A synthetic alert that fires every 30 or 60 minutes by design. If the canary stops firing, the alerting pipeline itself is broken.
It validates the chain: metric source, evaluator, alertmanager, paging integration, on-call schedule. A single canary covers all of it.
Without a canary, a broken pipeline is silent. You discover it when an outage doesn't page.
How to wire it up
Cron a metric that flips a boolean every 60 minutes, and an alerting rule that fires on the flip. Or use the alerting tool's built-in heartbeat (PagerDuty heartbeats, Opsgenie heartbeat API).
Route the canary to a low-priority channel, not the on-call rotation. The signal is the absence of the canary, not the canary itself.
Alert when no canary has been seen in 90 minutes. That gap is the SLA on alert-pipeline health.
Watching the canary watcher
The thing that detects a missing canary must run on different infrastructure from the alerting pipeline. Otherwise the same outage that kills alerts kills the canary check.
Cheap pattern: a scheduled GitHub Action or a tiny external uptime monitor checks PagerDuty's API for a heartbeat in the last 90 minutes.
When the canary stops, page on the meta-alert. This is the only time meta-alerting is justified.
What the canary catches
Prometheus rule files with syntax errors that silently fail to load.
Alertmanager routes that drop alerts due to misconfigured matchers.
PagerDuty integration keys that have rotated. On-call schedules that lapsed without a backup. Slack webhooks that 401.
How to roll it out this week
Add one canary per critical alerting path. If you have separate Datadog and Prometheus alerting, you need one canary on each.
Document the canary in the on-call runbook. New on-calls need to know the silent fire is the dangerous one.
Test it. Disable the canary deliberately and confirm the meta-alert pages within 90 minutes.