Alerts Practical By Samson Tanimawo, PhD Published Apr 22, 2026 4 min read

Alert Grouping Policy

Group related alerts. Reduces page count.

What a grouping policy specifies

Which labels collapse alerts into one notification. Typical: cluster, service, severity. Atypical: instance, pod_name. Grouping by pod_name defeats the purpose.

How long to wait before sending the first notification (group_wait) and how long between followups (group_interval).

The escalation path. Group too aggressively and a real second incident hides inside the first.

Sensible defaults

group_by: [alertname, cluster, service]. group_wait: 30s. group_interval: 5m. repeat_interval: 4h.

Severity-aware splits: critical alerts use shorter group_wait (10s) than warnings (60s).

Receiver-aware splits: PagerDuty receives one event per group; Slack receives the full list.

When to group tighter

Storms during deploys. Group all CrashLoopBackOff alerts in the same namespace within 1 minute.

Cluster-wide outages. Group everything per cluster with a longer wait so symptoms attach to the parent.

Network partition events. Use partition labels in group_by so each side of the partition shows up cleanly.

When to group looser

Multi-service incidents. Don't merge unrelated services just because they share a cluster label.

Cross-team alerts. Each team needs its own page; don't collapse two on-call rotations into one.

Sev1 paging tier should never be grouped across services. One page per affected service.

Codify and review

Put the grouping config in version control. Review changes like code.

Audit grouping outcomes monthly: alerts grouped, alerts ungrouped, alerts that should have been grouped but weren't.

Avoid changing grouping during incidents. The temptation to silence noise mid-storm leads to bad config that survives forever.