Alert Test-Fire Pattern
Synthetically fire alerts to verify the pipeline.
Why test-fire alerts
Most alerts have never fired in production. The wiring (rule, receiver, escalation, runbook link) is unproven.
A test fire confirms the path end-to-end. From metric to page to acknowledged human, in under 5 minutes.
Without test fires, the first real fire is also the first integration test. Don't run integration tests during outages.
How to test-fire
Inject a synthetic metric that crosses the threshold. For Prometheus, push to a test target that returns the right value for 5 minutes.
Use a labeled test rule: env=test in the matcher. Real receivers route real test fires to a non-prod PagerDuty service.
Verify acknowledgment, escalation, and resolution end-to-end.
How often
At rule creation: mandatory. The rule isn't merged until a test fire confirms the path.
Quarterly: rotate through the rule list, fire 10% per week. Every paging-tier rule sees a test within 90 days.
After any change to receivers, escalation, or PagerDuty service config: re-test the affected rules.
Automation
GitHub Actions or Argo Workflows runs the test injector on schedule. The job fails if the page didn't ack within window.
Datadog's API-driven monitors and synthetic tests cover the SaaS path. Trigger via Terraform.
Maintain a list of last-tested timestamps in the alert catalog. Stale tests are visible to all.
Do it for paging tier
Skip for ticket and email tiers. Cost-benefit isn't there.
Don't test-fire during business hours unless the test channel is clearly labeled. Avoid waking real on-call.
After the first quarter, the test cost amortizes. The first quarter is the expensive one.