Alert Routing by Data

Alert routing based on payload data, not just labels.

Routing by alert content

Data-driven routing inspects the alert payload and routes by labels: env, region, service, severity, team. Alertmanager’s tree of routes plus matchers covers most cases; Datadog’s monitor tags do the same. The reason to bother: one rule definition can serve many teams without manual receiver duplication.

Labels to route on

Pick the labels with care. Required: severity, service, env. Optional: region, cluster, team, customer_tier. Avoid volatile labels like instance or pod_name because they create unmanageable receiver explosions; validate label values because a typo in env=staging vs environment=staging silently misroutes pages.

Routing tree pattern

The routing tree usually splits in three layers. Top splits on severity (sev1 to PagerDuty, sev2 to Slack, sev3 to email queue); middle splits on team (each team has its own service or channel); leaf splits on env (production to on-call, non-prod to a daytime channel).

Managing routing changes

Routing config is code. Pull request, review, CI tests with promtool config check or Datadog terraform validate; runbook for routing changes during incidents because mid-incident edits are a known cause of missed escalations; snapshot live config nightly to S3 so diffs against last known good are possible.

Keep the tree shallow

Three rules keep routing maintainable. Three levels max because beyond that the tree becomes unmaintainable; skip data-driven routing if the team owns under 30 alerts because static config is fine; don’t route on customer-specific labels unless contractually required because per-customer pagers is the path to madness.