Alert Acceptance Test Tracking
Track which alerts have passed acceptance tests.
Why alerts need acceptance tests
Most alerts are never proven to fire. They sit in Prometheus or Datadog config for years until the day they should page and silently don't.
An acceptance test confirms the rule actually triggers under the failure mode it claims to detect. Without it, you have a config file, not a detector.
Test once at creation, then once per quarter. Drift in metric labels or query semantics breaks rules without warning.
How to run the test
Inject the failure. For a high-error-rate alert, route 5% of traffic to a chaos endpoint that returns 500s for 10 minutes.
Confirm the page lands in the right channel within the expected window. Burn-rate alerts have explicit windows; verify both the fast and slow burn fire.
Tag the alert metadata: last_tested, tested_by, test_method. Surface this in the alert catalog so reviewers can see staleness at a glance.
What to track
Total alerts, tested alerts, untested alerts, alerts whose last test is older than 90 days. Publish weekly.
Failed tests. An alert that was supposed to fire and didn't is a higher-priority defect than a flaky test.
Drift events. When a rule's underlying metric is renamed or relabeled, mark all its tests stale until re-run.
Tooling that helps
promtool test rules covers the unit-test layer for Prometheus. Pair with synthetic load injection in staging.
Datadog's monitor recovery tests and synthetic API tests cover the SaaS path. Trigger via Terraform from CI.
Argo Workflows or GitHub Actions can run quarterly validations. Open a Jira ticket automatically when a test fails.
Adopt incrementally
Start with paging-tier alerts only. Critical pages get tested first; ticket-only and email alerts get the same treatment once the critical tier is green.
Tie acceptance status to the incident review template. A post-incident finding that the alert was untested becomes its own action item.
Skip if the alert volume is under 30 rules. The overhead is real; tiny shops should focus on culling instead.