Alert Classification Engine
Auto-classify alerts as actionable or noise.
The problem
Most alerting systems treat every signal the same. A disk warning, a customer-impacting outage, and a one-off blip all flow through the same pipeline; without classification, on-call time is wasted on triage that the system could do automatically because the first 90 seconds of every page is often “is this real?”. Classification engines bucket alerts into actionable, informational, suppressed, or escalated before they reach a human.
- One pipeline for all signals. Disk warning, outage, blip; same flow.
- Wasted on-call triage. First 90 seconds: “is this real?”.
- Four buckets. Actionable, informational, suppressed, escalated.
- Pre-human classification. Engine triages before reaching the on-call.
Rule-based classification
Start with hand-written rules. Tag alerts by service, severity, and customer impact; route SEV1 to PagerDuty, SEV3 to Slack, SEV4 to a daily digest. PagerDuty event orchestration, Opsgenie integrations, incident.io workflows all support rule trees up to 4 or 5 levels deep; rule-based classification handles 80% of cases with low cost, moderate maintenance burden, and predictable behaviour.
- Hand-written rule trees. Tag by service, severity, customer impact.
- Vendor support. PagerDuty event orchestration, Opsgenie, incident.io.
- 4-5 level depth. Sufficient for most rule trees.
- 80% case coverage. Low cost, moderate maintenance, predictable.
ML-based classification
ML helps at scale but with caveats. Vendors like Moogsoft, BigPanda, Nova AI Ops use ML to cluster, dedupe, and score alerts (useful when alert volume exceeds 10k per day); ML models need training data because without 6 months of labelled history the system cannot distinguish noise from signal (don’t enable it on day one); black-box scoring breaks trust quickly so pick a vendor that explains why an alert was suppressed.
- Vendor ML cluster, dedupe, score. Moogsoft, BigPanda, Nova AI Ops.
- 10k/day volume threshold. Below that, rule-based is enough.
- 6 months training data. Without it, can’t distinguish noise from signal.
- Explainable suppression. Black-box breaks trust; explanations preserve it.
The feedback loop
The feedback loop keeps the classifier honest. On-call should be able to mark an alert as “this was noise” or “this was real” with one click and the classifier learns from this signal; without feedback the engine drifts (suppressed alerts that turn out to be real never feed back and the model degrades silently); audit weekly by pulling a sample of 20 suppressed alerts and confirming they were genuinely noise (mistakes here are how outages slip through).
- One-click feedback. “Noise” or “real”; classifier learns from signal.
- Without feedback, drift. Suppressed-but-real never feeds back; model degrades silently.
- Weekly 20-sample audit. Confirm genuine noise; outages slip through here.
- Per-week feedback rate. Tracked; supports continued classifier health.
Pick by scale
The pick is volume-driven. Under 1k alerts per day: rule-based via PagerDuty event rules is enough. 1k to 10k alerts per day: rule-based plus dedup and grouping, tune with monthly audits. Above 10k: an ML-backed classification engine pays for itself, but only if the feedback loop is wired and audited.
- Under 1k/day: rule-based. PagerDuty event rules; no ML needed.
- 1k-10k/day: rules plus dedup. Monthly audits keep tuning current.
- Above 10k: ML-backed. Pays for itself; feedback loop required.
- Per-volume scaling decision. Documented; supports continued fit.