Ack vs Resolve Discipline
Acknowledging stops paging; resolving closes the alert. Different.
The distinction
Ack and resolve are two states with two very different meanings. Treating them as the same thing breaks the metrics built on top of them and breaks the escalation logic that protects on-call.
- Ack. Tells the paging system a human is engaged. Stops escalation. Says nothing about whether the problem is fixed.
- Resolve. Tells the system the underlying problem is fixed. Closes the incident. Should never fire while the symptom is still active.
- Tooling parity. PagerDuty, Opsgenie, and incident.io all expose both states. The semantics matter for escalation, MTTA reporting, and post-incident analysis.
- Cost of confusion. MTTR drifts upward when ack is treated as resolve. Escalation paths fail when resolve is treated as ack. Every dashboard built on top inherits the corruption.
Ack discipline
Ack is a contract with the rotation: “I see this; do not wake anyone else.” The contract only works if the team treats it as a hard cutoff and follows up with channel context.
- 5-minute window. Ack within 5 minutes of being paged. After that the secondary on-call is paged automatically; the rotation depends on it.
- Ack does not mean fixed. It means “I see this, I am working on it, no need to page anyone else.” State this explicitly in the on-call onboarding doc.
- Channel post within 10 minutes. Ack is not triage. The on-call posts what they are seeing in the incident channel within 10 minutes so others can help.
- Hand-off rules. If the on-call cannot stay engaged, they re-page rather than silently leaving the ack in place. The ack is a commitment, not a shrug.
Resolve discipline
Resolve closes the incident. If the symptom returns five minutes later, the metric debt accumulates and the postmortem starts on a false foundation.
- Confirm before resolving. The signal must have returned to normal for at least 5 minutes, or the root cause must be confirmed mitigated. Resolving on hope is a regression generator.
- Auto-resolve off for high severity. A flapping signal that auto-resolves at the 4-minute mark masks an ongoing incident. Sev-1 and Sev-2 stay manual.
- Refire within 30 minutes. If the alert refires inside half an hour, treat it as the same incident in the post-incident review. Splitting the timeline hides the real duration.
- Resolve note. A one-line resolve comment captures what fixed it. The note is what future on-call reads when the same alert fires again.
What to measure
The four metrics below cover the ack/resolve loop. Track them per-service, not globally; one noisy service otherwise drags every headline.
- MTTA. Time from page to ack. Target under 5 minutes at p95. Above that means rotations are too thin or coverage gaps exist.
- MTTR. Time from page to resolve. Track per-service; a noisy service otherwise drags the headline number.
- Ack-to-resolve gap. Median time engaged on the incident. Useful for spotting alerts that are easy to ack but hard to fix.
- Refire rate. Share of incidents that refire within 30 minutes of resolve. A high rate signals premature resolves and feeds back into resolve discipline.
How to apply
The discipline is small but the rollout pays back fast. Most teams convert in two weeks once the first metric review surfaces the gap.
- Quarterly audit. Review the last quarter of incidents. Any with multiple resolves are flapping and need tuning.
- Confirmation window. Add a 1-minute confirmation delay before resolve becomes final. PagerDuty supports this through alert dependencies.
- On-call onboarding. Train new on-call to ack first, post in channel second, debug third. The order matters more than the speed.
- Dashboard parity. Surface MTTA, MTTR, ack-to-resolve, and refire on the same panel. Every metric review references one screen, not four.