Alert Routing by Data
Alert routing based on payload data, not just labels.
Routing by alert content
Static routing sends alerts to fixed receivers. Data-driven routing inspects the alert payload and routes by labels: env, region, service, severity, team.
Alertmanager's tree of routes plus matchers covers most cases. Datadog's monitor tags do the same.
The reason to bother: one rule definition can serve many teams without manual receiver duplication.
Labels to route on
Required: severity, service, env. Optional: region, cluster, team, customer_tier.
Avoid routing on volatile labels like instance or pod_name. They create unmanageable receiver explosions.
Validate label values. A typo in env=staging vs environment=staging silently misroutes pages.
Routing tree pattern
Top: split on severity. Sev1 to PagerDuty, sev2 to Slack channel, sev3 to email queue.
Middle: split on team. Each team has its own PagerDuty service or Slack channel.
Leaf: split on env. Production routes to on-call, non-prod routes to a daytime channel.
Managing routing changes
Treat the routing config like code. Pull request, review, CI tests with promtool config check or Datadog's terraform validate.
Have a runbook for routing changes during incidents. Mid-incident routing edits are a known cause of missed escalations.
Snapshot the live config nightly to S3. When something breaks, you can diff against the last known good.
Keep the tree shallow
Three levels max. Beyond that the tree becomes unmaintainable.
Skip data-driven routing if the team owns under 30 alerts. Static config is fine.
Don't route on customer-specific labels unless contractually required. Per-customer pagers is the path to madness.