Alert Acceptance Test Tracking
Track which alerts have passed acceptance tests.
Why alerts need acceptance tests
Most alerts are never proven to fire. They sit in Prometheus or Datadog config for years until the day they were supposed to page and silently don’t. The discipline below treats every alert rule the same way you treat a unit test: untested code does not ship, and untested alerts should not page.
- The detector contract. An acceptance test confirms the rule actually triggers under the failure mode it claims to detect. Without it, you have a config file, not a detector.
- Drift is silent. Metric labels rename, query semantics change, exporters version up. Rules break without warning because nothing in the alerting pipeline tells you they did.
- Cadence. Test once at creation and once per quarter at minimum. Critical pages on infrastructure that changes often deserve monthly cadence.
- Audit value. A tested-alert ratio is the cleanest proxy for “will the on-call hear about the next outage?” Treat it as a top-line reliability metric.
How to run the test
The test injects the failure mode the alert claims to detect, then verifies the page lands where it should within the window the SLO requires. Everything else is bookkeeping.
- Inject the failure. For a high-error-rate alert, route 5 percent of traffic to a chaos endpoint that returns 500s for 10 minutes. Use a feature flag scoped to the test.
- Verify the page. Confirm it lands in the right channel within the expected window. Burn-rate alerts have explicit windows; verify both the fast burn and the slow burn fire.
- Tag the metadata. Stamp the rule with
last_tested,tested_by,test_method. Surface these in the alert catalog so reviewers see staleness at a glance. - Recover cleanly. The test should auto-disable the chaos endpoint after the window closes. A test that leaves the system half-broken is worse than no test.
What to track
You cannot improve a number you do not publish. The four counts below go on the reliability dashboard and into the weekly review.
- Coverage. Total alerts, tested alerts, untested alerts. Plot the ratio over weeks; the trend matters more than the absolute number.
- Staleness. Alerts whose last test is older than 90 days. Mark them red and assign owners.
- Failed tests. An alert that was supposed to fire and didn’t is a higher-priority defect than a flaky test. Page on the failure.
- Drift events. When a rule’s underlying metric is renamed or relabeled, automatically mark all its tests stale until re-run.
Tooling that helps
You do not need a custom platform. The pieces below glue together in a long weekend and cover the most common alerting stacks.
- promtool test rules. Covers the unit-test layer for Prometheus alerting rules. Pair with synthetic load injection in staging.
- Datadog monitor tests. Monitor recovery tests and synthetic API tests cover the SaaS path. Trigger via Terraform from CI to keep the test definitions in source.
- Workflow runners. Argo Workflows or GitHub Actions schedule quarterly validations and open a Jira ticket automatically when a test fails.
- Catalog surface. Backstage or a flat Markdown index works. The point is one place where every reviewer can see test status without clicking through tools.
Adopt incrementally
Most teams that try to test every alert at once stall before they ship the first one. Sequence matters.
- Paging tier first. Critical pages get tested before anything else. Ticket-only and email alerts wait until the critical tier is green.
- Tie to incident review. A post-incident finding that the alert was untested becomes its own action item with an owner and a deadline.
- Skip below threshold. If your total alert volume is under 30 rules, the overhead of the testing rig outweighs the benefit. Cull the catalog instead.
- Promote when stable. Once a tier sustains 90 percent tested coverage for a quarter, expand to the next tier rather than backsliding.