Alert Investment Priorities

Where to invest alert engineering time. The ROI ranking.

The question

Alert engineering time is finite. Where does the next hour of tuning produce the biggest return? Most teams default to whichever alert fired last night, which is reactive not optimal. Rank by page volume times severity times customer impact because the top 3 alerts usually drive 60% of on-call pain.

Priority ranking

Three tiers cover most cases. Tier 1: alerts that page more than once per shift on average (fix first; the noise that destroys rotations). Tier 2: alerts with low MTTA but no clear customer impact (waste cognitive cycles). Tier 3: alerts that fire monthly but always require novel debugging (invest in runbooks, not detection).

Where to spend the time

Three patterns yield most return. Tighten thresholds first because 60% of noise comes from thresholds set during the panic of an old incident and never revisited; add customer-impact context next to reduce triage time even when the alert keeps firing; build dependency suppression last because it’s high leverage but expensive to set up.

Where not to spend

Three traps waste alert engineering time. ML anomaly detection on a fundamentally noisy signal (cleaning the signal source pays better than smarter detection); custom dashboards for alerts that are about to be deleted (don’t gold-plate doomed pages); tooling that requires a vendor migration to deliver (migration cost almost always exceeds the alert ROI).

Apply this week

The application is concrete. Pull last 30 days of pages and rank by volume; pick the top 3; allocate 4 hours per top alert (tune threshold, add impact text, link runbook, measure for 2 weeks); repeat monthly because the top 3 rotates but the discipline does not.