Alert Grouping Policy
Group related alerts. Reduces page count.
What a grouping policy specifies
A grouping policy specifies which labels collapse alerts into one notification, how long to wait before the first notification, how long between followups, and the escalation path. Group by cluster, service, severity is typical; grouping by pod_name defeats the purpose because each pod gets its own notification.
- Group-by labels. Typical: cluster, service, severity; atypical: instance, pod_name; pod_name grouping defeats the purpose.
- group_wait and group_interval. How long to wait before sending the first notification and how long between followups.
- Escalation path. Group too aggressively and a real second incident hides inside the first.
- Per-policy documented intent. The grouping rationale committed to the alertmanager config; supports investigation.
Sensible defaults
The defaults work for most teams. group_by: [alertname, cluster, service], group_wait: 30s, group_interval: 5m, repeat_interval: 4h. Severity-aware splits use shorter group_wait for critical (10s) than warnings (60s); receiver-aware splits send one event per group to PagerDuty and the full list to Slack.
- Default config. group_by: [alertname, cluster, service]; group_wait: 30s; group_interval: 5m; repeat_interval: 4h.
- Severity-aware splits. Critical alerts use shorter group_wait (10s) than warnings (60s).
- Receiver-aware splits. PagerDuty receives one event per group; Slack receives the full list.
- Per-receiver tuning. Each receiver tuned to its noise tolerance; supports the right experience per channel.
When to group tighter
Tighter grouping makes sense for storms, cluster-wide outages, and network partitions. Group all CrashLoopBackOff alerts in the same namespace within 1 minute; group everything per cluster with a longer wait during cluster-wide outages so symptoms attach to the parent; use partition labels in group_by so each side of a partition shows up cleanly.
- Storms during deploys. Group all CrashLoopBackOff alerts in the same namespace within 1 minute.
- Cluster-wide outages. Group everything per cluster with a longer wait so symptoms attach to the parent.
- Network partition events. Use partition labels in group_by so each side of the partition shows up cleanly.
- Per-storm policy match. Each storm class has a documented grouping; supports automatic noise reduction.
When to group looser
Looser grouping protects multi-service incidents, cross-team alerts, and sev1 paging. Don’t merge unrelated services just because they share a cluster label; each team needs its own page; sev1 should never be grouped across services because each affected service deserves its own page.
- Multi-service incidents. Don’t merge unrelated services just because they share a cluster label.
- Cross-team alerts. Each team needs its own page; don’t collapse two on-call rotations into one.
- Sev1 paging tier. Should never be grouped across services; one page per affected service.
- Per-tier ungrouping rule. The tiers that must not collapse are documented; supports correct paging.
Codify and review
The grouping policy lives in version control and gets reviewed like code. Monthly audit on grouping outcomes (grouped, ungrouped, should-have-been-grouped); avoid changing grouping mid-incident because the temptation to silence noise mid-storm leads to bad config that survives forever.
- Version-controlled config. Grouping rules in git; review changes like code.
- Monthly audit. Alerts grouped, alerts ungrouped, alerts that should have been grouped but weren’t.
- Don’t change mid-incident. The temptation to silence noise mid-storm leads to bad config that survives forever.
- Per-quarter grouping review. Outcomes reviewed against the actual incident pattern; supports continuous tuning.