Alert Acceptance Criteria
Each alert should pass acceptance criteria before launch.
The acceptance checklist
Every new alert must pass a fixed list before it can page humans. Treating the checklist as a hard CI gate is the difference between an alerting catalog you trust and one that pages on every dashboard hiccup.
- Required fields. Owner team, runbook URL, severity, expected fire rate, and a customer-impact sentence. Anything missing fails the gate.
- Reject in PR review. The cost of a noisy alert is paid every week by on-call; the cost of rejecting it is paid once by the author.
- CI enforcement. Encode the checklist in a linter on the alerting repo. Missing required fields blocks merge automatically.
- Versioned bar. Treat the checklist itself as code. Changes to the bar go through review just like an alert; the policy lives in the repo.
The runbook gate
An alert without a runbook is not deployable. The runbook is what converts the page from a notification into an actionable instruction.
- Minimum runbook content. How to confirm the issue is real, the first three diagnostic commands, and the escalation path. Less than that fails the gate.
- Stub rejection. “Investigate the issue” runbooks are stub-only and reviewers reject them. Stubs are paging debt.
- Runbooks as code. Markdown in git, reviewed, versioned. Link the SHA in the alert payload so the responder sees the runbook that matched.
- Drift audit. Periodically diff the linked runbook against current reality. Stale instructions burn responder time.
Predict the fire rate
Predicting fire rate before launch is the most underused signal in alert design. Backtesting on historical metrics catches noisy alerts before they reach on-call.
- 30-day backtest. Replay the alert against 30 days of historical metrics. The predicted fire count goes in the PR description.
- Noise threshold. More than one fire per week per alert is presumptive noise. Either tighten the condition or move the alert off paging.
- Reality check. Track actual versus predicted fire rate after launch. If reality is double the prediction, pull the alert back to shadow mode.
- Burn-rate variants. For SLO-based alerts, simulate the multi-window burn-rate firing on the same backtest. Single-window predictions miss SLO-aware paging.
Ownership and review
Every alert needs a named owning team. Orphan alerts that fire at 3am have nowhere to escalate, and the on-call engineer pays.
- Named owner. An alert without a named owning team is orphaned. Page-time triage is the wrong moment to discover ownership.
- Page-first then route. The owning team takes the page first even if the cause is in a dependency. They route to the dependency team after triage.
- Quarterly review. Each team retires alerts that did not fire or fired without action. The catalog stays curated rather than accumulating dead rules.
- Bus-factor check. If only one engineer on the team understands the alert, that is a single point of failure; spread the runbook knowledge before approving merge.
Operational rollout
New alerts ship through three phases, not directly to paging. Each phase is a separate change so the team can stop the rollout when the alert misbehaves.
- Phase 1: shadow mode. Seven days, logging only. Confirms the rule fires when expected without paging anyone.
- Phase 2: ticket-only. Seven days creating tickets without paging. Catches false-positive volume before it reaches on-call.
- Phase 3: paging. The alert pages. Promote only if phase 2 showed the predicted fire rate; otherwise hold or roll back.
- Shared contract. Document the criteria in the runbook repo so every team uses the same bar. The criteria become a shared contract, not a per-team preference.