Incident Comms Templates
Five templates, detection, investigation, mitigation, resolution, post-mortem. Stop drafting customer-facing text mid-incident.
Detection, first 15 minutes
Acknowledge fast, commit to nothing, set the next-update time. Customers tolerate "we're investigating" far better than silence.
Status page (Investigating):
"We're investigating reports of [symptom, e.g., elevated error rates on checkout]. Some users may experience [user-facing impact]. Our team is engaged and we'll share an update by [time + 30 min]."
Internal Slack post in #incidents:
"INC-<ID> declared Sev-<N> at <HH:MM UTC>. Symptom: <one line>. IC: @<name>. Comms: @<name>. Bridge: <link>. Channel: #inc-sev2-<ID>."
Investigation, actively debugging
Once you've confirmed the impact, name it specifically. Vague is worse than specific-and-wrong; you can correct specific.
Status page (Identified, if you have a hypothesis):
"We've identified [specific component, e.g., elevated database latency in the us-east-1 region] as the source of the issue. Engineers are applying a mitigation now. Next update by [time + 30 min]."
Status page (still Investigating, no theory yet):
"Our team is still investigating the cause of [symptom]. The impact is limited to [specific surface, e.g., new user signups; existing user sessions are unaffected]. Next update by [time + 30 min]."
Mitigation, fix is rolling
Don't claim resolution at the moment of mitigation. The fix is in flight; you'll know it worked when error rates settle.
Status page (Identified → in progress):
"A mitigation is rolling out across all regions. Most users should see [symptom] subsiding within the next [N] minutes. We'll confirm full resolution once metrics return to baseline. Next update by [time + 20 min]."
Internal Slack:
"Mitigation deployed at <HH:MM UTC>, <what shipped>. Watching error rate; expecting baseline within <N> min. Will not declare resolved until 30 minutes of clean signal."
Resolution, confirmed clean
Wait the full window before posting. Premature resolution announcements are how you earn a "Identified → Resolved → Investigating" status-page sequence that looks awful.
Status page (Resolved):
"This incident has been resolved. Between [start time] and [end time], users may have experienced [impact summary]. Service has been operating normally for [N] minutes. We'll publish a detailed post-mortem within [5 business days]. Thanks for your patience."
Customer email (for Sev-1 only):
"On [date], between [start] and [end] UTC, [product/feature] was [impact]. We've identified [root-cause summary in plain language] and shipped a fix. We're following up with a public post-mortem at [link] within [N] business days. If you have questions or were materially affected, reply to this email and our team will respond personally."
Post-mortem note
Posted within 48h for Sev-1, 5 business days for Sev-2. Blameless, specific, action-oriented.
Public post-mortem header:
"Incident: [one-sentence summary]. Duration: [HH:MM]–[HH:MM] UTC, [date]. Impact: [user-facing impact, with numbers]. Root cause: [specific technical cause]. Detection: [how we noticed; time-to-detect]. Mitigation: [what we did; time-to-mitigate]. Action items: [N items, each with owner and due date]."
Internal close-out:
"INC-<ID> closed at <time>. Total duration: <HH:MM>. TTD: <mins>. TTM: <mins>. TTR: <mins>. Post-mortem doc: <link>. Action items tracked in <board>."
Comms rules
The patterns that keep status pages credible.
- Update every 30 minutes minimum, even if the update is "still investigating." Silence reads as panic
- Always commit to a next-update time. "Next update by 14:30 UTC", not "we'll update soon"
- Lead with impact, not cause. Customers care what's broken for them, not which service is unhealthy
- No engineering jargon on the status page. "Database connection pool exhausted" → "users may see slow page loads"
- One Comms Lead. Two people drafting from different windows is how you contradict yourself in public
- Never say "fixed" before the metric confirms it. "A mitigation is rolling out" → "Service is recovering" → "Resolved (after 30 min stable)"
- Don't blame third parties on the status page. Even if AWS is the cause, your customer's contract is with you. Note the dependency in the post-mortem, not the realtime update