Alert on Anomalies, Not Norms
Don't alert on normal behaviour. Alert on deviations.
The rule
The rule is uncompromising. Don’t alert on what is normal; alert on what is unusual. “Service is up” is not an alert; “service is down” is. Many teams set alerts on healthy-state thresholds without considering whether those thresholds represent anomalies or daily traffic; anomaly-based alerts produce 5-10x less noise than threshold-based alerts on the same signal.
- Don’t alert on normal. Alert on the unusual; the rule that bounds noise.
- Healthy-state thresholds are noise. Many teams set them without considering whether they represent anomalies.
- 5-10x noise reduction. Anomaly-based alerts vs threshold-based on the same signal.
- Per-alert anomaly check. The discipline asks “is this unusual?” before each alert ships.
Detecting normal
Detecting normal is statistical. Pull 14 days of data and compute median, p95, p99 per minute-of-day; plot the distribution because the bands at p5-p95 are “normal” and anything outside is candidate alert territory; use the same distribution to set alert thresholds (p99 plus 2 stddev is a defensible upper bound).
- 14-day pull. Median, p95, p99 per minute-of-day; the data window.
- Plot distribution. p5-p95 bands are normal; outside is candidate alert territory.
- Threshold from distribution. p99 plus 2 stddev is a defensible upper bound; below that is normal.
- Per-signal normal definition. The bands documented per signal; supports investigation when the band changes.
Detecting anomalies
Anomaly detection has a hierarchy. Statistical methods first (Holt-Winters, Prophet, Datadog seasonal monitors handle 80% of cases); ML models second only if the signal has hidden structure (multivariate, conditional, most signals don’t); always include a rate-of-change check because a signal that moves from p50 to p99 in 2 minutes is anomalous even if p99 is “normal”.
- Statistical methods first. Holt-Winters, Prophet, Datadog seasonal monitors; cover 80% of cases.
- ML second only if structure. Multivariate, conditional; most signals don’t need ML.
- Rate-of-change check. p50 to p99 in 2 minutes is anomalous even if p99 is in the normal band.
- Per-method scope. Statistical for periodic, ML for hidden structure, rate-of-change always.
Common traps
Three traps catch teams. Alerting on absolute zero traffic (zero is not anomalous on Saturday night for an internal admin tool, use rate-of-change); alerting on a single high reading (one-minute spikes are noise, 3 consecutive minutes is signal); forgetting to update baselines after launches (post-launch traffic doubles the baseline and old thresholds become permanent false positives).
- Zero-traffic alerts. Zero is not anomalous on Saturday night for internal admin tools; use rate-of-change.
- Single-spike alerts. One-minute spikes are noise; 3 consecutive minutes is signal.
- Stale baselines after launches. Post-launch traffic doubles the baseline; old thresholds become permanent false positives.
- Per-launch baseline reset. Reset the baseline after major changes; supports continued accuracy.
How to apply
The application is iterative. Audit your noisiest 5 alerts and replace fixed thresholds with anomaly-based logic; test in shadow mode for 2 weeks and promote when the false-positive rate is under 5%; repeat for the next 5 because the noise budget shrinks each cycle.
- Audit noisiest 5. Replace fixed thresholds with anomaly-based logic; the highest-leverage targets.
- Shadow mode for 2 weeks. Promote when false-positive rate is under 5%.
- Repeat next 5. Rinse, repeat; the noise budget shrinks each cycle.
- Per-cycle improvement record. Documented per audit; supports continued investment.