How to Write a Postmortem

Most postmortems take a week because nobody starts. A timer, a template, and four prompts will get you a publishable doc the same day the incident closes.

Start within 24 hours

The single biggest determinant of a useful postmortem is when you start writing. Within 24 hours of resolution, the timeline is fresh, the Slack thread is intact, and the engineers who fought the fire still remember what they actually saw on screen. Wait a week and half of it is reconstructed from memory.

This window is the whole game. If you’re writing the postmortem from a week-old memory, you’ll spend an hour reconstructing the timeline alone. Same incident, same content; the cost is what changed.

Schedule a slot the day after the incident. Block it on your calendar. Open the template. Set a timer.

The template

Six sections, in order. Don’t move them around; they’re in this order because each one builds on the last:

Summary, 3 sentences. What broke, who was affected, how long.
Timeline, bullet list with timestamps. Copy from Slack.
What happened, technical narrative, 2-4 paragraphs. Include the specific config value, the specific commit, the specific exception. Vague postmortems are useless postmortems.
What we did, the steps the responders took, in order, with what worked and what didn’t.
Action items, 3-7 items, each with an owner and a date. Fewer is better; ones that ship are better than ones that don’t.
Lessons, 2-4 bullet points. The thing future-you should remember.

That’s the whole template. No background section, no “impact analysis”, no executive summary. The summary is the executive summary. Add structure only when an actual reader complains it’s missing.

The four prompts

If you stare at a blank doc, you’ll never start. These four prompts get the content out of your head in about 8 minutes total:

What were customers seeing at the worst minute? Three concrete things, not abstractions. “Checkouts returned 503” not “degraded service”.
What single thing, if it had been different, would have made this not happen? Forces you past the surface symptom. Usually surfaces a config, a deploy, or a missing alarm.
What did we do during the incident that didn’t help? The honest version of this is the most valuable section of the postmortem. Most teams skip it; that’s why they keep doing the wrong thing in the next incident.
What do we want to be different in three months? This becomes the action items list. Concrete, datable, ownable.

Answer each prompt in 2-3 sentences. That’s your raw material. The structure of the doc just rearranges these answers.

Slack thread to draft

The incident channel is your timeline. Don’t reconstruct it; copy it. Open the channel, scroll to the first “something is wrong” message, copy through to “all clear”. Paste into the timeline section as raw text first, then trim.

Cut anything that isn’t a state change or a decision. “Looking now” gets cut. “Found the bad commit at abc1234” stays. “Rolling back” stays. “Did anyone ever figure out what that other thing was?” gets cut. Aim for 8-15 timeline entries for a one-hour incident.

This is the part that takes 5 minutes if you do it the day after and 45 minutes if you do it a week later.

The 5-minute peer review

Before publishing, hand the draft to one other engineer who was on the bridge. Ask them three questions: is the technical detail right, is the timeline right, and is anyone named in a way that feels like blame. Five minutes is enough; if they need longer the postmortem isn’t ready.

Don’t pass it through three rounds of comms-team review before publishing. The audience for an internal postmortem is engineers who weren’t there; comms-style language hides the technical content they need to learn from. If you have customer-facing comms to do separately, that’s a separate document with a different audience.

Publishing

Publish to wherever your team reads. A team wiki, a shared doc folder, an internal blog. Drop a link in the team channel and the engineering-wide channel. The point is that other people learn from your incident; if it’s buried in a personal Google Doc, it might as well not exist.

If you can publish externally (status-page-level summary, customer-facing blog), do it within a week. Public postmortems compound trust over time; teams that publish them have shorter sales cycles and quieter customer complaints. Not every incident merits one. The big ones do.

Last move: schedule the action items. Each one needs an owner and a date in the team’s tracker before the postmortem is “done”. Action items that don’t make it to the tracker don’t exist. Schedule a 30-day review of the action items as part of the same calendar block; the half-life of unscheduled action items is about 11 days.