Incident Comms Templates

Five templates, detection, investigation, mitigation, resolution, post-mortem. Stop drafting customer-facing text mid-incident.

Detection, first 15 minutes

Acknowledge fast, commit to nothing, set the next-update time. Customers tolerate "we're investigating" far better than silence.

Status page (Investigating):

"We're investigating reports of [symptom, e.g., elevated error rates on checkout]. Some users may experience [user-facing impact]. Our team is engaged and we'll share an update by [time + 30 min]."

Internal Slack post in #incidents:

"INC-<ID> declared Sev-<N> at <HH:MM UTC>. Symptom: <one line>. IC: @<name>. Comms: @<name>. Bridge: <link>. Channel: #inc-sev2-<ID>."

Investigation, actively debugging

Once you've confirmed the impact, name it specifically. Vague is worse than specific-and-wrong; you can correct specific.

Status page (Identified, if you have a hypothesis):

"We've identified [specific component, e.g., elevated database latency in the us-east-1 region] as the source of the issue. Engineers are applying a mitigation now. Next update by [time + 30 min]."

Status page (still Investigating, no theory yet):

"Our team is still investigating the cause of [symptom]. The impact is limited to [specific surface, e.g., new user signups; existing user sessions are unaffected]. Next update by [time + 30 min]."

Mitigation, fix is rolling

Don't claim resolution at the moment of mitigation. The fix is in flight; you'll know it worked when error rates settle.

Status page (Identified → in progress):

"A mitigation is rolling out across all regions. Most users should see [symptom] subsiding within the next [N] minutes. We'll confirm full resolution once metrics return to baseline. Next update by [time + 20 min]."

Internal Slack:

"Mitigation deployed at <HH:MM UTC>, <what shipped>. Watching error rate; expecting baseline within <N> min. Will not declare resolved until 30 minutes of clean signal."

Resolution, confirmed clean

Wait the full window before posting. Premature resolution announcements are how you earn a "Identified → Resolved → Investigating" status-page sequence that looks awful.

Status page (Resolved):

"This incident has been resolved. Between [start time] and [end time], users may have experienced [impact summary]. Service has been operating normally for [N] minutes. We'll publish a detailed post-mortem within [5 business days]. Thanks for your patience."

Customer email (for Sev-1 only):

"On [date], between [start] and [end] UTC, [product/feature] was [impact]. We've identified [root-cause summary in plain language] and shipped a fix. We're following up with a public post-mortem at [link] within [N] business days. If you have questions or were materially affected, reply to this email and our team will respond personally."

Post-mortem note

Posted within 48h for Sev-1, 5 business days for Sev-2. Blameless, specific, action-oriented.

Public post-mortem header:

"Incident: [one-sentence summary]. Duration: [HH:MM]–[HH:MM] UTC, [date]. Impact: [user-facing impact, with numbers]. Root cause: [specific technical cause]. Detection: [how we noticed; time-to-detect]. Mitigation: [what we did; time-to-mitigate]. Action items: [N items, each with owner and due date]."

Internal close-out:

"INC-<ID> closed at <time>. Total duration: <HH:MM>. TTD: <mins>. TTM: <mins>. TTR: <mins>. Post-mortem doc: <link>. Action items tracked in <board>."

Comms rules

The patterns that keep status pages credible.

Update every 30 minutes minimum, even if the update is "still investigating." Silence reads as panic
Always commit to a next-update time. "Next update by 14:30 UTC", not "we'll update soon"
Lead with impact, not cause. Customers care what's broken for them, not which service is unhealthy
No engineering jargon on the status page. "Database connection pool exhausted" → "users may see slow page loads"
One Comms Lead. Two people drafting from different windows is how you contradict yourself in public
Never say "fixed" before the metric confirms it. "A mitigation is rolling out" → "Service is recovering" → "Resolved (after 30 min stable)"
Don't blame third parties on the status page. Even if AWS is the cause, your customer's contract is with you. Note the dependency in the post-mortem, not the realtime update