Fan-In and Fan-Out Alert Patterns
Some alerts aggregate; others split. Patterns.
Two patterns
Fan-in and fan-out are different routing problems. Fan-in: many signals collapse to one alert (50 instances each fire “high CPU”, on-call needs one page about the cluster). Fan-out: one signal triggers alerts to multiple teams (database alert reaches DB team, platform team, on-call rotation). Both are about routing; they are not opposites.
- Fan-in: many to one. 50 instances of “high CPU” collapse to one cluster page.
- Fan-out: one to many. Database alert reaches DB team, platform team, on-call rotation.
- Both about routing. Not opposites; different problems with different solutions.
- Per-pattern documented choice. The pattern committed per alert; supports investigation when alerts surprise.
When to fan in
Fan-in fits multiple sources signaling the same root cause: cluster-wide CPU spikes, region-wide error increases, fleet-wide deploy failures. Group by service, region, or deployment using PagerDuty event orchestration or Nova AI Ops grouping windows; group window 5-15 minutes because shorter misses related signals and longer holds back real escalation.
- Same root cause, many sources. Cluster-wide CPU spikes, region-wide errors, fleet deploy failures.
- Group by service, region, deployment. PagerDuty event orchestration, Nova AI Ops grouping windows.
- 5-15 minute group window. Shorter misses related signals; longer holds back real escalation.
- Per-fan-in scope. The grouping label set committed; supports correct collapse without over-merging.
When to fan out
Fan-out fits one alert genuinely affecting multiple teams. A database outage hits the DB team (fix), platform team (capacity context), and on-call (paging); use distinct routing rules per audience and tailor the payload (don’t send the same payload to all three); avoid fan-out for political reasons (“manager wants to be notified of everything”) and build a digest instead.
- Genuine multi-team impact. Database outage: DB team fixes, platform sees capacity context, on-call pages.
- Distinct routing rules. Per audience; tailor the payload.
- Avoid political fan-out. “Manager wants notification” is a digest, not a fan-out.
- Per-audience payload. Each route gets the audience-appropriate text; supports correct response.
Anti-patterns
Three anti-patterns survive too long. Fan-out everything (every team gets every alert, fatigue spreads across the org); fan-in too aggressively (one mega-alert per region per hour, real signals suppressed under “something happened somewhere”); fan-in based on service-name pattern matching (breaks when a service is renamed).
- Fan-out everything. Every team gets every alert; alert fatigue spreads across the org.
- Fan-in too aggressively. One mega-alert per region per hour; real signals suppressed.
- Pattern-match fan-in. Breaks when a service is renamed; the routing fails silently.
- Per-anti-pattern lint. CI catches the common anti-patterns; the discipline lives in the linter.
Pick by use case
The decision is use-case driven. Same root cause, many sources: fan in with grouping windows. Different audiences, one signal: fan out with tailored routes. When in doubt, don’t add either because most alerts work fine as one signal to one team.
- Many sources, one cause: fan in. Use grouping windows; collapse to one page.
- One signal, many audiences: fan out. Tailor each route; respect the audience.
- When in doubt: neither. Most alerts work fine as one signal to one team.
- Per-decision documentation. The fan-in or fan-out choice committed per alert; supports later review.