Alert Routing Patterns: From Severity to Service Owner
Where an alert lands matters more than how loudly it rings. The wrong routing wakes the wrong engineer at 3 a.m., adds 20 minutes of context-switching to MTTR, and corrodes on-call morale. Here are five routing patterns that actually work.
Why Routing Is the Highest-Leverage Alert Decision
An alert that fires the right person in the right context resolves in minutes. The same alert routed to the wrong rotation, or to a generic "ops" channel where nobody owns it, can sit unacknowledged for hours. The cost is not theoretical: every minute of misrouted MTTR is a minute of customer impact, a minute of revenue loss, and a fraction of a unit of on-call burnout building up across the team.
Most teams design alerting in this order: define the metrics, set the thresholds, configure the on-call schedule, and treat routing as an afterthought. This is backwards. Routing should be designed first, because it determines what kind of alert volume each engineer can plausibly handle and what kind of context each page arrives with. The five patterns below cover 95% of real production scenarios.
Pattern 1: Service-Owner Routing
What it is: Each alert is routed to the on-call rotation for the team that owns the affected service. A payment-svc alert fires the payments team's on-call. An auth-svc alert fires the auth team's on-call.
When to use it: This is the default for any organization with more than 20 engineers. Service ownership is the foundation of effective on-call: the engineer paged is the one who can actually fix the problem.
How to implement: Maintain a service registry (sometimes called a service catalog or a CMDB) that maps every service to its owning team. Every alert label or annotation should include a service identifier. Your alerting tool routes based on the service tag.
# Prometheus alert with service routing
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
labels:
severity: critical
service: payment-svc
team: payments
annotations:
runbook: "https://wiki/runbooks/payment-svc-5xx"
The team: payments label drives the routing rule in Alertmanager or PagerDuty. The runbook annotation tells the on-call engineer where to start.
Common gotcha: Services without a clear owner. Every service registry has a long tail of "platform tools," "shared libraries," and "owned by the org" services that get paged but have no team to receive them. Solution: assign every service to a real team, including a "platform" or "SRE" catch-all team for the genuinely shared ones. No service may be unowned.
Pattern 2: Severity Ladders
What it is: Alerts are tagged with a severity (typically SEV1, SEV2, SEV3) that determines both the routing destination and the notification channel. SEV1 wakes someone via phone call. SEV2 sends a push notification. SEV3 posts to a Slack channel without paging anyone.
When to use it: Every team with multiple severity tiers needs this. The biggest mistake teams make is setting every alert to "critical" and then wondering why on-call is exhausted.
How to implement: Define explicit, documented severity criteria. SEV1 = customer-impacting outage requiring immediate action. SEV2 = customer-impacting degradation requiring action within 30 minutes. SEV3 = anomaly requiring investigation but not paging. Every alert rule names its severity in the labels.
The routing tool maps severity to notification channel:
- SEV1: Phone call + SMS + push + Slack mention. Acknowledgment required within 5 minutes or escalates.
- SEV2: Push notification + Slack mention. Acknowledgment required within 15 minutes.
- SEV3: Slack channel post. No paging, but tracked for response within business hours.
Common gotcha: Severity inflation. Every alert author thinks their alert is critical. Without governance, SEV1 becomes meaningless within 6 months. Solution: review the SEV1 catalog quarterly and aggressively demote anything that did not actually wake someone with a customer-impacting outage in that period.
Pattern 3: Follow-the-Sun
What it is: The on-call rotation hands off across geographic regions during the local team's working hours. A pager that rings at 3 a.m. in San Francisco gets routed to a London engineer at 11 a.m. local instead.
When to use it: Organizations with engineering presence in two or three time zones who can spread on-call across the globe. The benefits are massive: nobody gets paged at night, and on-call shifts stay within working hours.
How to implement: Define multiple regional rotations (US-West, EMEA, APAC) and configure your routing tool to switch the active rotation based on the time of day. PagerDuty calls this "schedule layers"; Opsgenie calls it "rotation overrides"; incident.io provides explicit follow-the-sun primitives.
The hard part is not the tooling, it is the staffing. Real follow-the-sun requires at least one engineer per region with the operational knowledge to triage and resolve incidents independently. Without that, follow-the-sun degrades into "the night shift acknowledges and escalates," which is worse than just paging the on-duty engineer in their home time zone.
Common gotcha: Handoff context loss. The London engineer waking up to an incident that started at 2 a.m. SF time may have no idea what is going on. Solution: require explicit handoff messages in Slack at the rotation boundary, even when no incidents are active. "Past 8 hours: nothing happening" is information.
Pattern 4: Time-of-Day Overrides
What it is: Routing rules differ between business hours and off-hours. During business hours, a low-severity alert posts to a team Slack channel. After hours, the same alert is suppressed entirely until the next business day.
When to use it: When the business genuinely tolerates a delay between detection and response for low-severity issues. Most internal tools, batch processing failures, and non-customer-facing systems fit here.
How to implement: Define business hours per region in your routing tool, and write rules that change behavior based on the time. The simplest formulation:
- SEV1 alerts: page 24/7, no time-of-day override.
- SEV2 alerts: page during business hours, push to Slack outside business hours (no page).
- SEV3 alerts: Slack only, suppressed outside business hours.
Common gotcha: Holiday calendars and weekends. Most tools handle business hours but mishandle holidays. Solution: explicitly configure your team's holiday calendar in the routing tool, including company-wide holidays and any team-specific time off.
Pattern 5: AI-Driven Smart Routing
What it is: Instead of static routing rules, an AI system analyzes the incoming alert (its symptoms, recent change events, similar past incidents) and routes it to the best-matched engineer or team. The routing decision adapts as the system learns which engineers resolved which kinds of incidents fastest in the past.
When to use it: Larger organizations where service ownership boundaries are blurry, where incidents frequently span multiple teams, or where the routing rule maintenance burden has become a real cost.
How to implement: Modern AI-native incident platforms like Nova AI Ops handle this natively. The platform ingests alerts, identifies the root cause (which often spans multiple services), determines which team's expertise matches the actual problem (not just the surface symptom), and routes the page accordingly. The same AI suggests the right runbook to attempt and provides historical context from similar past incidents.
Smart routing is also useful for handling the long tail of unowned alerts. Instead of falling through to a generic "platform" rotation, the AI examines the alert content and routes to the team most likely to be able to investigate, even if no static ownership rule existed.
Common gotcha: Trust calibration. Engineers initially distrust AI routing decisions and want to override them, which defeats the purpose. Solution: start with AI routing in suggestion mode (the AI proposes a destination, the human confirms) for the first month, build confidence, then move to autonomous routing for routine alerts while keeping suggestion mode for edge cases.
Anti-Patterns to Avoid
Three routing patterns that look reasonable and almost always backfire:
Anti-pattern 1: The "Ops" catch-all rotation. Every alert that nobody else owns goes to the platform team. This rotation grows linearly with the org and becomes a burnout machine. Fix: assign every service to a real owning team, including ones inside the platform org.
Anti-pattern 2: Email-only alerts. Email is not a paging channel. Engineers do not check email at 3 a.m., and even during business hours email gets buried. If an alert is worth firing, it is worth firing through Slack, push, or phone. Fix: phase out email alerting entirely.
Anti-pattern 3: One-engineer rotations. The "single point of failure" rotation where one person is on-call without backup. They go on vacation, the company goes blind. Fix: every rotation has at least three people and a documented escalation path.
How to Combine the Patterns
The strongest production setups combine four of the five patterns. A reasonable default for a 100-engineer org:
- Service-owner routing as the foundation: every service belongs to a team, every team has a rotation.
- Severity ladders on top: SEV1 phones, SEV2 pushes, SEV3 Slacks.
- Follow-the-sun if you have engineering presence in multiple regions, and the staff to make it real.
- Time-of-day overrides for SEV2/SEV3 to protect off-hours sleep.
- Smart routing as an evolution path: start with static rules, layer AI routing on top once volume justifies it.
The goal is a system where the right engineer wakes up with the right context and the wrong engineer never gets paged at all. Every routing decision should be reviewable in retrospect, and the rules should be code (in version control, peer-reviewed, tested) rather than UI clicks. See how AI-native routing works in Nova AI Ops or start free.