Alert Grouping Policy

Group related alerts. Reduces page count.

What a grouping policy specifies

A grouping policy specifies which labels collapse alerts into one notification, how long to wait before the first notification, how long between followups, and the escalation path. Group by cluster, service, severity is typical; grouping by pod_name defeats the purpose because each pod gets its own notification.

Sensible defaults

The defaults work for most teams. group_by: [alertname, cluster, service], group_wait: 30s, group_interval: 5m, repeat_interval: 4h. Severity-aware splits use shorter group_wait for critical (10s) than warnings (60s); receiver-aware splits send one event per group to PagerDuty and the full list to Slack.

When to group tighter

Tighter grouping makes sense for storms, cluster-wide outages, and network partitions. Group all CrashLoopBackOff alerts in the same namespace within 1 minute; group everything per cluster with a longer wait during cluster-wide outages so symptoms attach to the parent; use partition labels in group_by so each side of a partition shows up cleanly.

When to group looser

Looser grouping protects multi-service incidents, cross-team alerts, and sev1 paging. Don’t merge unrelated services just because they share a cluster label; each team needs its own page; sev1 should never be grouped across services because each affected service deserves its own page.

Codify and review

The grouping policy lives in version control and gets reviewed like code. Monthly audit on grouping outcomes (grouped, ungrouped, should-have-been-grouped); avoid changing grouping mid-incident because the temptation to silence noise mid-storm leads to bad config that survives forever.