Smart Alert Routing: Why Round-Robin Wakes the Wrong People

Round-robin sounds fair and isn’t. Routing on signal beats routing on rotation.

Why round-robin fails

Round-robin treats every alert the same. A database alert and a CDN alert wake whoever’s next, regardless of who owns the system. Half the pages reach someone who must page someone else.

That second hop costs 3-5 minutes and burns trust.

Four routing signals

1. Service tag on the alert. Routes to the team that owns the service.
2. Severity. Pages route differently than tickets.
3. Time of day. Follow-the-sun routing for global teams.
4. Skill required. A k8s alert routes to the platform team, not the application team.

Owner-of-record

Every service has a documented owner-of-record team. The team owns the alerts; the team rotates through on-call internally. The page knows who to wake without thinking.

Service catalogs (Backstage, Port) make this structural. Without a catalog, the mapping rots.

Escalation trees

Primary on-call → secondary → manager → broader team Slack. Each escalation triggered by no-acknowledgment within N minutes.

The escalation tree is documented in the alert routing config, not in tribal memory.

Antipatterns

One Slack channel for all alerts. Signal drowns in volume.
No owner-of-record. Pages bounce; nobody owns; trust degrades.
Manual escalation. Forgotten in the moment.

What to do this week

Three moves. (1) Apply this pattern to your noisiest alert. (2) Measure pages-per-shift before/after for one week. (3) Schedule the quarterly review so the discipline survives team turnover.