Alert Scope Creep Prevention
Alerts that get tweaked drift from purpose. Prevent.
The problem
Alerts created with a clear purpose drift over time. A new condition is added, then another, until the alert fires on a different problem than originally intended; six months later the on-call sees the alert fire and has no idea what to do because the original purpose is lost. Scope creep is silent and happens through small PRs each defensible in isolation.
- Drift over time. New conditions added incrementally; original purpose lost.
- Six-month forgetting. On-call sees alert, doesn’t know what to do; the cumulative cost.
- Silent through small PRs. Each defensible in isolation; the aggregate is the problem.
- Per-alert purpose erosion. The discipline starts at clarity and erodes through additions.
How it happens
Three patterns produce scope creep. An incident teaches you that condition X also matters and you add it to an existing alert instead of creating a new one; a team member who didn’t write the alert tweaks the threshold to fix a single false positive (the alert now fires for a different reason); aggregate operators (max, p99) get added incrementally and nobody knows which signal is actually being measured.
- Add condition to existing alert. Incident teaches X matters; team adds to existing alert.
- Threshold tweak fixes false positive. Different reason than designed; the alert drifts.
- Aggregate operators added. max, p99 incrementally; signal becomes unclear.
- Per-pattern lint. CI catches the patterns at PR time; supports prevention.
Prevention
Three mechanisms prevent scope creep. Every alert has a one-line purpose statement (“detect customer-facing checkout latency above SLO”) in the alert payload; code review requires any change to the alert logic to explain whether the purpose statement still holds; quarterly alert review walks the catalog with the owning team and splits or deletes anything where purpose has drifted.
- One-line purpose statement. “Detect customer-facing checkout latency above SLO”; in payload.
- Review checks purpose holds. Logic change must defend that purpose still holds.
- Quarterly catalog walk. Drift gets split into two alerts or deleted; the recurring discipline.
- Per-alert purpose persistence. The statement lives with the alert; supports future review.
Split vs extend
The default is splitting. Two alerts with clear purposes are easier to operate than one alert with two purposes; extend only when the underlying signal is genuinely the same (“p99 from Prometheus” plus “p99 from Datadog APM” is one alert with redundancy, not two purposes); if a single alert needs two runbooks, it’s two alerts.
- Default to splitting. Two alerts with clear purposes beat one alert with two purposes.
- Extend on same signal. p99 from Prometheus plus Datadog APM is redundancy, not two purposes.
- Two runbooks = two alerts. The runbook count is the test for splitting.
- Per-extension justification. Each extension defended in PR; supports the discipline.
Apply this quarter
The application is concrete. Audit your top 10 paging alerts and confirm each has a single, clear purpose; split any alert with two purposes (migrate carefully and don’t delete the original until the new alerts have run for 2 weeks); add the purpose-statement field to your alert template and reject new alerts without one.
- Top 10 paging audit. Confirm single clear purpose per alert.
- Split alerts with two purposes. Run new alerts for 2 weeks before deleting original.
- Purpose-statement field required. Reject new alerts without one; the template enforces.
- Per-alert audit record. Documented per audit; supports continued discipline.