Multi-Team Postmortem Coordination
An incident that crossed four teams will produce four postmortems unless someone owns the meta-doc. The owner-of-record pattern stops the fragmentation and gets you one document the whole org can learn from.
The four-postmortem problem
Real outages cross teams. A checkout outage involves the checkout team, the payments team, the platform team, and maybe the database team. Each team writes their own postmortem. Each one says “the upstream system was unhealthy” or “the downstream service rejected our requests”. Nobody describes the actual incident; everyone describes their narrow slice.
The reader of all four documents has to assemble the truth themselves. They don’t. The org learns nothing systemic; each team learns its narrow lesson; the next cross-team incident has the same shape.
The fix is simple in concept and politically tricky in practice: one postmortem per incident, with one owner, even if four teams contributed. The trick is making this work without trampling on each team’s ownership of their own systems.
Owner-of-record
For each cross-team incident, name an owner-of-record. The OoR is one specific person, on whichever team had the most engineers on the bridge during the incident. They own the meta-postmortem document. Their job:
- Write the unified timeline (pulling from each team’s incident channel).
- Coordinate the technical narrative section across teams.
- Aggregate the action items into one prioritised list.
- Schedule and run the joint review meeting.
The OoR is not the technical lead; they’re the editor. They don’t need to understand every team’s system in depth. They need to ask each team “what happened on your side and why” and write down the answer.
The OoR role is a real workload, usually 6-10 hours over a week for a SEV-1 incident. The team they’re on needs to acknowledge that and not penalise them for the lost development time. The first time you do this it feels like overhead; by the third cross-team incident, the team realises one good shared postmortem is worth ten team-specific ones.
Scoping the meta-doc
The meta-doc replaces the team-specific postmortems for the cross-team aspects. It does not replace internal team docs about their own systems. The split:
- Meta-doc: timeline across all teams; technical narrative of how the incident propagated; cross-team action items; lessons that affect more than one team.
- Team-internal: deep dive on the team’s specific system; team-specific action items; technical context only that team needs.
Most teams find that 70% of their content goes into the meta-doc and 30% stays internal. That’s the right ratio. If a team produces a 12-page internal postmortem alongside the meta-doc, they’re duplicating effort and probably contradicting themselves.
Rule of thumb: anything another team needs to know to avoid the same incident goes in the meta-doc. Anything that’s purely internal to the team’s implementation goes in the team doc.
Joint review meeting
One joint review meeting per cross-team incident, ~60 minutes, scheduled within 5 business days of the incident closing. Attendance:
- The OoR (running the meeting).
- One representative per affected team. Should be the engineer who was on the bridge, not their manager.
- One observer from each team that wasn’t directly involved but is in a position to learn from the incident.
The meeting structure: 10 minutes timeline walkthrough, 30 minutes per-team narrative (each team presents their part, 5-7 minutes each), 20 minutes action items.
The single most important rule: each team presents their own part. The OoR does not present another team’s narrative. The team representative speaks for their team’s actions. This is what keeps the postmortem honest; nobody has to defend a story written about them by an outsider.
Cross-team action items
Cross-team action items are the hardest. Three patterns we’ve seen work:
- Boundary-contract items. The fix is a written contract between two teams about what one will do and what the other can rely on. Specific. Tested. Signed (figuratively) by the leads of both teams.
- Shared-runbook items. The fix is a runbook that lives at the boundary, either team’s on-call can run it. Both teams maintain it. Quarterly review.
- Eliminate-the-coupling items. The fix is to make the incident class impossible by removing the coupling that caused it. These are the biggest items but produce the best long-term outcomes. Track them with longer SLAs (90+ days) but don’t let them die.
What doesn’t work: action items that say “teams should communicate better”. That’s a wish, not a fix. Specific contracts, specific runbooks, specific architectural changes.
When external vendors are involved
Vendor-side outages add a complication: the vendor writes their own postmortem on their own timeline, usually 1-3 weeks after the incident. The internal meta-doc shouldn’t wait for it.
The pattern: write the internal meta-doc within a week, with the vendor section marked “pending vendor RCA”. Action items for the vendor side don’t go in the team trackers (they’re not your team’s work) but go in a vendor-management tracker that the partnerships or platform team owns.
When the vendor RCA arrives, append it to the meta-doc as an addendum. Don’t rewrite the doc; the original was correct based on what was known at the time, and the vendor’s view should be readable as a separate perspective.
The lesson, written into the platform team’s onboarding doc: “Cross-team incidents are not four teams’ problems. They’re one organisation’s problem with four perspectives. Write one document.”