Fan-In and Fan-Out Alert Patterns

Some alerts aggregate; others split. Patterns.

Two patterns

Fan-in and fan-out are different routing problems. Fan-in: many signals collapse to one alert (50 instances each fire “high CPU”, on-call needs one page about the cluster). Fan-out: one signal triggers alerts to multiple teams (database alert reaches DB team, platform team, on-call rotation). Both are about routing; they are not opposites.

When to fan in

Fan-in fits multiple sources signaling the same root cause: cluster-wide CPU spikes, region-wide error increases, fleet-wide deploy failures. Group by service, region, or deployment using PagerDuty event orchestration or Nova AI Ops grouping windows; group window 5-15 minutes because shorter misses related signals and longer holds back real escalation.

When to fan out

Fan-out fits one alert genuinely affecting multiple teams. A database outage hits the DB team (fix), platform team (capacity context), and on-call (paging); use distinct routing rules per audience and tailor the payload (don’t send the same payload to all three); avoid fan-out for political reasons (“manager wants to be notified of everything”) and build a digest instead.

Anti-patterns

Three anti-patterns survive too long. Fan-out everything (every team gets every alert, fatigue spreads across the org); fan-in too aggressively (one mega-alert per region per hour, real signals suppressed under “something happened somewhere”); fan-in based on service-name pattern matching (breaks when a service is renamed).

Pick by use case

The decision is use-case driven. Same root cause, many sources: fan in with grouping windows. Different audiences, one signal: fan out with tailored routes. When in doubt, don’t add either because most alerts work fine as one signal to one team.