Alert Tuning Cadence
Alerts rot. Tune them on a schedule.
Alerts need scheduled tuning
Threshold drift is real: traffic grows, code changes, dependencies swap, and last year’s threshold becomes this year’s noise. Tuning needs a cadence (quarterly floor, monthly for high-traffic teams) because reactive tuning after a noisy weekend skips the analysis and produces worse thresholds than proactive tuning.
- Threshold drift. Traffic grows, code changes, dependencies swap; last year’s threshold becomes this year’s noise.
- Quarterly floor. Quarterly is the minimum cadence; monthly for high-traffic teams.
- Reactive worse than proactive. Tuning after a noisy weekend skips analysis; produces worse thresholds.
- Per-team tuning calendar. Cadence on the calendar; supports the discipline against the urgent.
Data-driven tuning
Tuning should be data-driven. Pull 30-day metrics; compute candidates (99th percentile, 95th percentile, 3-sigma above mean); backtest each against the last 90 days; pick the threshold that maximises true positives and caps false positives at the team’s noise budget (e.g., 1 false positive per week per alert).
- Pull 30-day metrics. The data window for candidate computation.
- Threshold candidates. 99th percentile, 95th percentile, 3-sigma above mean; multiple candidates worth comparing.
- Backtest 90 days. How many alerts would have fired, how many real incidents would have been caught.
- Pick by noise budget. Maximise true positives, cap false positives at 1 per week per alert.
Automate where possible
Adaptive alerts handle most metrics automatically. Datadog Watchdog, Prometheus predict_linear, SLO burn-rate alerts; replace fixed thresholds for traffic-shape metrics. Keep fixed thresholds for explicit SLA values like 99.9% availability; don’t fully automate because human review for threshold changes that affect paging is the safety net.
- Adaptive alerting tools. Datadog Watchdog, Prometheus
predict_linear, SLO burn-rate alerts; cover traffic-shape metrics. - Replace most fixed thresholds. Adaptive alerts work for traffic-shape; fixed thresholds are the exception.
- Keep fixed for SLAs. Explicit SLA values like 99.9% availability stay fixed; the contract is the threshold.
- Human in the loop. Threshold changes that affect paging need human review; the safety net.
Tuning review checklist
Each tuned alert passes a three-question checklist. Did it fire in the last quarter (if no, why is it still here); of the fires, how many were actioned (action rate below 50% means the threshold is wrong); has the underlying service changed shape (traffic, dependencies, code).
- Fire frequency. Did the alert fire this quarter; if no, why is it still here.
- Action rate. Of the fires, how many were actioned; below 50% means the threshold is wrong.
- Service shape change. Traffic, dependencies, code; if yes, threshold is suspect.
- Per-alert review record. Each tuned alert documented; supports later audit.
Where to start
Start with high-leverage moves. Tune the 5 noisiest alerts this quarter with backtested thresholds; move 3 fixed-threshold alerts to SLO burn-rate alerts; add tuning to the on-call review meeting at 15 minutes per alert with same-day loop closure.
- 5 noisiest alerts first. Tune with backtested thresholds; the highest-leverage starting point.
- 3 fixed-threshold migrations. Move to SLO burn-rate alerts; the structural improvement.
- 15 minutes per alert in on-call review. Close the loop the same day; supports the cadence discipline.
- Per-quarter delta tracking. Documented improvement quarter-over-quarter; supports continued investment.