Alert Acknowledgement Pattern
Acknowledging an alert tells the system you're on it.
Ack is a contract, not a button
Acknowledging an alert tells the system a human owns the response. It silences re-pages, starts the response timer, pauses escalation; treat ack as a commitment to be working the issue within 5 minutes (if you ack and walk away, the system stays quiet while customers suffer); an unacknowledged page after the first re-notify (typically 5 minutes) escalates to the next responder, which is the safety net for missed pages.
- Ack signals human ownership. Silences re-pages; starts response timer; pauses escalation.
- Commitment to work within 5 minutes. Ack-and-walk-away keeps system quiet while customers suffer.
- Re-notify and escalation. Unacked after 5 minutes escalates; the safety net for missed pages.
- Per-page contract. Each ack is a commitment; supports culture of accountability.
Ack timing targets
Time-to-ack deserves an SLO. Sev1 under 5 minutes; Sev2 under 15 minutes; track per-incident and per-rotation; if 95th-percentile time-to-ack is above target for a quarter, the rotation is understaffed or the paging tool is unreliable. PagerDuty, Opsgenie, Incident.io all expose ack-time data so pull it into a quarterly on-call health review.
- Sev1: under 5 minutes. The headline target.
- Sev2: under 15 minutes. Less urgent but still time-bounded.
- 95th percentile target. Above target for a quarter means understaffed or unreliable paging tool.
- Quarterly on-call review. Ack-time data from PagerDuty, Opsgenie, Incident.io.
When the acker gets stuck
Acked but stalled is its own anti-pattern. The page is silenced but the incident is not advancing; build a re-page on stalled-ack so if the alert is still firing 15 minutes after ack and no incident has been opened, page again; encourage early escalation because the on-call who escalates at 10 minutes saves more time than the one who solos for 45.
- Acked-and-stalled. Page silenced; incident not advancing; the silent failure mode.
- Re-page on stalled ack. 15 minutes after ack with no incident opened: page again.
- Early escalation encouraged. 10-minute escalation beats 45-minute solo attempt.
- Per-incident progress check. Acked-and-stalled flagged; supports correct response.
Auto-acknowledgement traps
Auto-acknowledgement is dangerous. Auto-resolving alerts on metric recovery is fine; auto-acknowledging on integration-bot pings is dangerous because the system thinks a human is responding when nobody is. Never let a chatbot or workflow ack a sev1, humans only; if you must auto-ack, log it loudly and require human follow-up within 5 minutes.
- Auto-resolve on recovery: OK. Metric recovery means alert no longer applies.
- Auto-ack on bot ping: dangerous. System thinks human responding; nobody is.
- Never auto-ack sev1. Humans only; the absolute rule.
- If you must, log loudly and require follow-up. 5-minute human follow-up window.
How to fix a broken ack culture
Fixing ack culture is concrete. Audit a month of incidents tagging each as acked-and-resolved, acked-and-escalated, acked-and-stalled, never-acked; if “acked-and-stalled” is more than 10% of incidents, your rotation is treating ack as a snooze button; coach the rotation that ack means working it now, re-page on stalled ack, track and publish ack-to-resolution time per engineer per quarter.
- Month-long audit. Tag each incident: acked-resolved, acked-escalated, acked-stalled, never-acked.
- 10% stalled threshold. Above means rotation treats ack as snooze button.
- Coach: ack means working it now. Re-page on stalled ack; the cultural reset.
- Per-engineer ack-to-resolution. Tracked and published quarterly; supports accountability.