Customer-Facing Incident Comms Templates

The hardest part of incident comms is writing them while the incident is still on fire. Pre-written templates with three slots make the writing 10x faster.

Why templates win

Writing customer comms during an active incident is a context-switch from debugging that nobody has bandwidth for. Templates with fill-in slots reduce a 10-minute writing task to 60 seconds. The 9 minutes saved is engineering time on the actual fix.

The other reason templates win: they enforce consistency across incidents and across the team. Without a template, the first comms says "we are looking into it," the second incident's comms says "we're aware of issues," the third says "investigation underway." Customers comparing notes hear different voices and lose trust. Templated comms sound like one company.

Templates are also where the team's communication discipline lives. The four-word rule against "sorry for any inconvenience," the rule that resolution comms must include cause, the rule that progress comms must include a next-update timestamp, these all live in the template, not in someone's head. New on-callers absorb the discipline by using the template, not by reading a 20-page comms guide.

Template 1, Acknowledgement

"We are investigating reports of [SYMPTOM] affecting [SCOPE]. Customers may experience [USER-VISIBLE EFFECT]. We will provide an update within [INTERVAL]."

Three slots. Send within 15 minutes for SEV1, 30 for SEV2.

Each slot has rules. SYMPTOM is the technical name customers can search for ("checkout timeout", "delayed email delivery", "missing dashboard data"). SCOPE is the affected segment in customer terms ("paying customers in EU", "all users on the web app", "users with API integrations"). USER-VISIBLE EFFECT translates the symptom into impact ("payments may fail", "you may not see today's data yet"). The discipline is to write each slot from the customer's perspective, not from the team's.

The interval matters. "Within 30 minutes" is a promise; missing it costs trust. If the team isn't sure when the next update will be, write "within 15 minutes" rather than "as soon as we know more", the second formulation has no bound, and customers experience it as silence.

Template 2, Progress

"We have identified [LIKELY CAUSE] and are [CURRENT MITIGATION ACTION]. The estimated time to resolution is [WINDOW]. Next update at [TIME]."

Send every 30 minutes during the incident. The "estimated window" is critical, even "we don't know yet" is better than silence. Customers fill silence with their worst assumptions; an honest "we don't know yet" beats most fabricated comfort.

LIKELY CAUSE is the first slot people get wrong. Engineers write it as a technical statement ("we identified a regression in the v2.4 deploy that caused N+1 queries against the user-preferences table"). Customers read this and tune out. Rewrite it as user-impact ("a recent change introduced a slowdown that's making login take longer than expected"). Same fact; speaks to the audience.

CURRENT MITIGATION ACTION is what you're doing right now, not what you might do later. "Rolling back the deploy" beats "evaluating options." "Failing over to the secondary region" beats "looking at the database." Specific actions communicate that the team is acting, not deliberating.

Template 3, Resolution

"This incident has been resolved as of [TIME]. The cause was [BRIEF CAUSE]. We have [WHAT YOU DID]. A full postmortem will be published [DATE]."

Send within 30 minutes of resolution. The 30-minute deadline matters because customers experiencing the incident are watching your status page; the gap between resolution and the all-clear is when they're filing tickets and writing internal status reports themselves. Closing the loop fast saves them work.

BRIEF CAUSE is two sentences max. The full postmortem can be long; the resolution comm cannot. "An expired certificate caused authentication failures for some users. The certificate was renewed and traffic returned to normal." Anyone reading the comm walks away knowing what happened and that you understand it.

The postmortem-publication date is a commitment. Pick a realistic date, typically 5-10 business days for substantial incidents, and meet it. Customers that watch postmortem rigour as a buying signal include enterprise procurement teams; missing or delayed postmortems show up in their renewal scoring.

Tone rules

Active voice. Past or present tense, never future-conditional ("we may have"). Avoid hedge words. Customers want to know that you know what's happening, even saying "we don't yet know" is more confident than "it appears that perhaps something might be."

Specifically pronouns. "We" not "the team" or "engineering", those distance the customer from the actor. "Our" not "the platform's", same logic. The reader is interacting with your company, not your platform; the comms should reflect that.

Numbers when you have them. "Affecting roughly 12% of paying customers" reads better than "affecting some customers." Even an order-of-magnitude estimate ("less than 5%") beats vagueness. The number tells the customer you've measured the impact, which signals that you understand it.

Four words to never use

"Sorry for the inconvenience", minimises. Customers experiencing a real outage hear this as dismissive. Replace with the specific cost: "we know this disrupted your morning's work" or "we know this prevented some checkouts."

"Some users", vague; specify. Replace with "users in [region]" or "users on [tier]" or "approximately N% of customers." Vague scope makes everyone wonder if they're affected; specific scope lets unaffected users get back to work.

"Brief outage", let the customer decide if it was brief. A 20-minute outage during your peak business hour is not brief from your perspective even if your engineering team experienced it as quick. Use the actual duration: "an outage lasting 47 minutes."

"Should be fixed", it either is or isn't. Hedging undermines the resolution comm's job, which is to declare the all-clear. Replace with "is now resolved as of HH:MM" or "monitoring continues; we will update if anything changes."

Cadence through a long incident

First update within 15 minutes (SEV1) or 30 (SEV2). Then every 30 minutes during active investigation. Every hour during stable monitoring. Final resolution update within 30 minutes of all-clear. Skipping a cadence is the moment trust drops.

The cadence is also a forcing function on the team. If the IC owes a customer update at the top of every 30 minutes, the team is forced to consolidate progress every 30 minutes. A bridge with no comms cadence drifts into 90 minutes of unstructured debugging; a bridge tied to comms cadence stays organised.

What to do when the cadence outruns the work. The right move is to publish the cadence update with no new information: "We are still investigating. The current theory remains [X]. We expect to know more within 30 minutes." Customers prefer "no news" to silence; team members get a rhythm marker.

What to do this week

Three moves. (1) Stand up a templates document in your incident-channel topic. Three templates, fill-in slots highlighted. The on-call should be able to copy, fill three slots, and post in 60 seconds. (2) Audit your last 5 incident comms for the four banned words ("sorry for the inconvenience", "some users", "brief", "should"). Most teams find at least 2-3 hits, that's tomorrow's training material. (3) Set up a status-page integration that auto-creates the acknowledgement comm from the templated form. Removes the "I forgot to write the comms" failure mode entirely.