Scheduled Incident Comms
Templated five-step status updates that fire during long-running incidents, so a customer comms manager isn't refreshing your status page at 3am while engineers fix the actual outage.
Why scheduling exists
The pattern we kept seeing during long incidents: engineers go heads-down on the fix, customer comms goes heads-down on the status page, and somewhere around the 90-minute mark the status page goes silent because the comms manager is in the engineering bridge instead of writing an update. Customers notice. The "no news for 47 minutes" pattern is the one that breaks customer trust during a recoverable outage.
Scheduled incident comms solves it by pre-committing to a cadence. When an incident is declared, Nova schedules five updates against the timeline, first acknowledgement, investigation, mitigation in progress, resolved, post-incident. Each one is a templated draft auto-populated with the latest known state from the incident channel and the agent ledger. Comms reviews and ships; or skips and writes their own.
The five-step template
The cadence is opinionated because the alternative, letting every team author their own, produces inconsistency that's worse than over-prescription. The five steps are: (1) acknowledged, sent within 5 minutes of declaration; (2) investigating, sent at the 20-minute mark with what we know and don't know; (3) mitigating, sent when an action plan is in motion; (4) resolved, sent when monitoring confirms recovery; (5) summary, sent within 24 hours with a one-paragraph post-mortem teaser.
Each step has a default body Nova generates from the incident metadata, services affected, customer impact estimate, current ETA, and any actions the agents have taken. The defaults are conservative and reviewable. We don't auto-publish without sign-off; the schedule fires a draft, the draft sits in the comms inbox, and a human approves before it goes out.
Authoring the cadence
Most teams use the default cadence verbatim. The teams that don't fall into two camps. Heavily regulated services (finance, healthcare) want more steps with stricter language and longer review windows. Internal-only services want fewer steps with looser language and no public status-page push. Both are configurable via a single template editor.
The template language has three primitives, variables ({{eta}}, {{services}}, {{agent_actions}}), conditionals ({% if customer_impact %}), and includes (so you can pull a standard "what we're doing" block). Most edits are five-minute customisations, not from-scratch authoring.
The schedule itself is configurable too. The default cadence is roughly 5min / 20min / 45min / on-resolution / 24h, with the third update repeating every 45 minutes if the incident is still open. Teams that want hourly customer updates during long incidents bump the repeat interval; teams that want only the start and end keep just steps 1 and 4.
Human-in-the-loop
None of this is autopilot. The comms drafts go to a review queue tagged with severity and ETA-to-publish; the on-call comms person sees them in a single-page inbox with diff view against the previous update. One click to publish, one click to edit, one click to skip with a reason logged.
The review pattern matters because LLM-generated comms are good 90% of the time and dangerously wrong the other 10%. The wrong cases are usually overconfidence, claiming a fix is in place when one agent is still trying things, or claiming an ETA when the data doesn't support it. Human review catches these in the 30 seconds it takes to read a draft; without review, the same wrong claim ships and erodes customer trust.
A real example
Here's how it played out during a recent beta-tenant incident. Database failover at 02:14 UTC; declared at 02:15. Step-1 acknowledgement went out at 02:18 ("we're aware of elevated error rates affecting checkout"). Engineers worked the bridge; agents ran diagnostics; the status of "investigating" updated automatically. Step-2 investigating went out at 02:35 with the agent's correlation summary ("primary database failover; standby promoted; replica lag elevated").
Step-3 mitigation went out at 03:14, the comms manager on call accepted the draft as-is. Resolution at 03:48; step-4 published at 03:52 after a one-line manual edit. The 24-hour summary went out at 02:00 the next day with the post-mortem link. Total comms manager time involved: under 10 minutes across the whole incident, none of it before 7am the next morning.
That's the specific outcome we're aiming for. The status page reflects reality, customers see updates on a predictable cadence, and the comms team is well-rested enough to do good work the next day. Long-running incident response is a marathon; pacing the comms is a load-bearing part of finishing it.