Alert Test-Fire Pattern

Synthetically fire alerts to verify the pipeline.

Why test-fire alerts

Most alerts have never fired in production, leaving the wiring (rule, receiver, escalation, runbook link) unproven. A test fire confirms the path end-to-end from metric to page to acknowledged human in under 5 minutes; without test fires, the first real fire is also the first integration test, which is the wrong place to run integration tests.

How to test-fire

Test-fire injection is straightforward. Inject a synthetic metric that crosses the threshold; for Prometheus, push to a test target that returns the right value for 5 minutes; use a labeled test rule (env=test in the matcher) so receivers route real test fires to a non-prod PagerDuty service. Verify acknowledgment, escalation, and resolution end-to-end.

How often

The cadence has three triggers. At rule creation: mandatory, rule not merged until a test fire confirms the path. Quarterly: rotate through the rule list firing 10% per week so every paging-tier rule sees a test within 90 days. After any change to receivers, escalation, or PagerDuty config: re-test affected rules.

Automation

Automation makes the test cost sustainable. GitHub Actions or Argo Workflows runs the test injector on schedule, with the job failing if the page didn’t ack within window; Datadog’s API-driven monitors and synthetic tests cover the SaaS path via Terraform; maintain last-tested timestamps in the alert catalog so stale tests are visible.

Do it for paging tier

Test-firing is targeted at paging tier. Skip ticket and email tiers because cost-benefit isn’t there; don’t test-fire during business hours unless the test channel is clearly labeled to avoid waking real on-call; the test cost amortises after the first quarter, which is the expensive one.