Alert Test-Fire Pattern
Synthetically fire alerts to verify the pipeline.
Why test-fire alerts
Most alerts have never fired in production, leaving the wiring (rule, receiver, escalation, runbook link) unproven. A test fire confirms the path end-to-end from metric to page to acknowledged human in under 5 minutes; without test fires, the first real fire is also the first integration test, which is the wrong place to run integration tests.
- Most alerts unproven. Never fired in production; rule, receiver, escalation, runbook link untested.
- End-to-end confirmation. Test fire goes metric to page to acknowledged human in under 5 minutes.
- Don’t test during outages. First real fire as first integration test produces compounding incidents.
- Per-alert wiring confidence. Test fire is the primary confidence mechanism for an alert that hasn’t fired naturally.
How to test-fire
Test-fire injection is straightforward. Inject a synthetic metric that crosses the threshold; for Prometheus, push to a test target that returns the right value for 5 minutes; use a labeled test rule (env=test in the matcher) so receivers route real test fires to a non-prod PagerDuty service. Verify acknowledgment, escalation, and resolution end-to-end.
- Inject synthetic metric. Cross the threshold; for Prometheus, push to a test target returning the right value for 5 minutes.
- Labeled test rule. env=test in the matcher; receivers route to a non-prod PagerDuty service.
- End-to-end verification. Acknowledgment, escalation, resolution; the full chain validated.
- Per-test artifact. The test outcome captured in the alert catalog; supports later audit.
How often
The cadence has three triggers. At rule creation: mandatory, rule not merged until a test fire confirms the path. Quarterly: rotate through the rule list firing 10% per week so every paging-tier rule sees a test within 90 days. After any change to receivers, escalation, or PagerDuty config: re-test affected rules.
- At rule creation. Mandatory; rule not merged until a test fire confirms the path.
- Quarterly rotation. Fire 10% per week; every paging-tier rule sees a test within 90 days.
- On config change. Re-test rules affected by receiver, escalation, or PagerDuty config changes.
- Per-rule last-tested timestamp. Tracked in the alert catalog; stale tests visible to all.
Automation
Automation makes the test cost sustainable. GitHub Actions or Argo Workflows runs the test injector on schedule, with the job failing if the page didn’t ack within window; Datadog’s API-driven monitors and synthetic tests cover the SaaS path via Terraform; maintain last-tested timestamps in the alert catalog so stale tests are visible.
- Scheduled CI runner. GitHub Actions or Argo Workflows; the test injector on schedule.
- Job fails on no-ack. If the page doesn’t ack within window, the job fails; the failure surfaces a wiring break.
- Datadog API path. Synthetic tests via Terraform; SaaS path covered.
- Last-tested in catalog. Stale tests visible to all; supports the freshness discipline.
Do it for paging tier
Test-firing is targeted at paging tier. Skip ticket and email tiers because cost-benefit isn’t there; don’t test-fire during business hours unless the test channel is clearly labeled to avoid waking real on-call; the test cost amortises after the first quarter, which is the expensive one.
- Paging tier only. Skip ticket and email tiers; cost-benefit isn’t there.
- Avoid real-on-call wake. Don’t test-fire during business hours unless the test channel is clearly labeled.
- First quarter expensive. Test cost amortises after; the discipline pays back.
- Per-tier test policy. Documented per tier; supports consistent investment.