Smart Alert Routing: Why Round-Robin Wakes the Wrong People
Round-robin sounds fair and isn’t. Routing on signal beats routing on rotation.
Why round-robin fails
Round-robin treats every alert the same. A database alert and a CDN alert wake whoever’s next, regardless of who owns the system. Half the pages reach someone who must page someone else.
That second hop costs 3-5 minutes and burns trust.
Four routing signals
- 1. Service tag on the alert. Routes to the team that owns the service.
- 2. Severity. Pages route differently than tickets.
- 3. Time of day. Follow-the-sun routing for global teams.
- 4. Skill required. A k8s alert routes to the platform team, not the application team.
Owner-of-record
Every service has a documented owner-of-record team. The team owns the alerts; the team rotates through on-call internally. The page knows who to wake without thinking.
Service catalogs (Backstage, Port) make this structural. Without a catalog, the mapping rots.
Escalation trees
Primary on-call → secondary → manager → broader team Slack. Each escalation triggered by no-acknowledgment within N minutes.
The escalation tree is documented in the alert routing config, not in tribal memory.
Antipatterns
- One Slack channel for all alerts. Signal drowns in volume.
- No owner-of-record. Pages bounce; nobody owns; trust degrades.
- Manual escalation. Forgotten in the moment.
What to do this week
Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.