Alert Acceptance Criteria
Each alert should pass acceptance criteria before launch.
The acceptance checklist
Every new alert must pass a fixed list before it can page humans. Owner team, runbook URL, severity, expected fire rate, and a customer-impact sentence.
Reject any alert in PR review that fails the list. The cost of a noisy alert is paid every week by on-call; the cost of rejecting it is paid once by the author.
Encode the checklist in CI. A linter on the alerting repo blocks merges if required fields are missing.
The runbook gate
An alert without a runbook is not deployable. The runbook needs at least: how to confirm the issue is real, the first 3 diagnostic commands, and the escalation path.
Stub runbooks ("investigate the issue") fail the gate. Reviewer rejects them.
Treat runbooks as code. Markdown in git, reviewed, versioned. Link the SHA in the alert payload so the responder sees the runbook that matched the alert.
Predict the fire rate
Backtest the alert over 30 days of metrics. Predicted fire count goes in the PR description.
More than one fire per week per alert is presumptive noise. Either tighten the condition or move it off paging.
Track actual versus predicted fire rate after launch. If reality is double the prediction, the threshold is wrong; pull the alert back to shadow mode.
Ownership and review
An alert without a named owning team is orphaned the day it fires at 3am. The on-call engineer has nowhere to escalate.
Owning team takes the page first, even if the cause is in a dependency. They route to the dependency team after triage.
Quarterly alert review by each team. Retire alerts that did not fire, or that fired without action being taken.
Operational rollout
Phase 1: shadow mode for 7 days, logging only. Phase 2: ticket-only for 7 days. Phase 3: paging.
Each phase is a separate change. If the alert is too noisy in phase 2, it never reaches phase 3.
Document the criteria in the runbook repo so every team uses the same bar. The criteria become a shared contract, not a per-team preference.