Alert Tuning Cadence

Alerts rot. Tune them on a schedule.

Alerts need scheduled tuning

Threshold drift is real: traffic grows, code changes, dependencies swap, and last year’s threshold becomes this year’s noise. Tuning needs a cadence (quarterly floor, monthly for high-traffic teams) because reactive tuning after a noisy weekend skips the analysis and produces worse thresholds than proactive tuning.

Data-driven tuning

Tuning should be data-driven. Pull 30-day metrics; compute candidates (99th percentile, 95th percentile, 3-sigma above mean); backtest each against the last 90 days; pick the threshold that maximises true positives and caps false positives at the team’s noise budget (e.g., 1 false positive per week per alert).

Automate where possible

Adaptive alerts handle most metrics automatically. Datadog Watchdog, Prometheus predict_linear, SLO burn-rate alerts; replace fixed thresholds for traffic-shape metrics. Keep fixed thresholds for explicit SLA values like 99.9% availability; don’t fully automate because human review for threshold changes that affect paging is the safety net.

Tuning review checklist

Each tuned alert passes a three-question checklist. Did it fire in the last quarter (if no, why is it still here); of the fires, how many were actioned (action rate below 50% means the threshold is wrong); has the underlying service changed shape (traffic, dependencies, code).

Where to start

Start with high-leverage moves. Tune the 5 noisiest alerts this quarter with backtested thresholds; move 3 fixed-threshold alerts to SLO burn-rate alerts; add tuning to the on-call review meeting at 15 minutes per alert with same-day loop closure.