Coordinating an Incident Across Five Teams
Single-team incidents are hard. Multi-team incidents are a different game, coordination overhead becomes the bottleneck, not the technical work. The patterns that scale.
A different shape of problem
A single-team incident is mostly technical: find the bug, fix it. A five-team incident is mostly coordination: who owns what, who's blocked on whom, how decisions get made when each team has its own context. Engineering muscle does not scale linearly here.
The structural difference. Five teams can do five times the parallel investigation if coordinated. Without coordination, they do five times the work but with overlap, missed signals, and conflicting decisions, net throughput is often LOWER than one team would have managed. The bottleneck shifts from "engineering capacity" to "coordination capacity."
The instinctive response, "we need more engineers!", usually makes the problem worse. Adding a sixth team without solving the coordination problem multiplies the chaos. The right response is to invest in coordination first, then expand if the coordination model has slack.
Single IC, no exceptions
One Incident Commander across all teams. Not one per team. The IC is the single point of decision-making and the single point of customer comms. Without the singular IC, every team makes locally optimal decisions that conflict globally.
The reason. Each team has its own view of the system; each team's "obvious right move" reflects only its local context. Without a coordinating IC, Team A rolls back a deploy that Team B was depending on for a workaround. Team C activates its DR procedure not knowing Team D was about to do the same. Each individual decision is reasonable; the aggregate is incoherent.
The IC selection. The IC is whoever's on the IC rotation that week, or whoever the IC pages first if rotation isn't established. The IC does NOT need to know the most about any single affected system. They need to be good at coordination. Domain expertise lives on the per-team driver level, not at the IC level.
Per-team driver
Each team has one named "driver" who reports up to the IC. The driver speaks for the team. Their job is to translate the team's investigation into a single sentence the IC can use to coordinate.
The driver protects the team's focus. Without a driver, every engineer on every team is trying to participate in the IC-level conversation; with a driver, only one person per team interfaces with the IC, and the rest of the team focuses on technical work. Five teams produces five conversations with the IC, not 25.
The driver's job is mostly translation. The team's investigation produces detailed technical state; the driver compresses it to "we know X, we don't know Y, we are trying Z" for the IC. The compression preserves the actionable information and drops the technical detail the IC doesn't need.
Cross-team status interval
Every 10 minutes, each driver gives a 30-second status to the IC. Format: "we know X, we don't know Y, we are trying Z." The IC consolidates and publishes one cross-team status to customers and stakeholders.
The 10-minute cadence is faster than a single-team incident's 10-minute cadence. The reason: with five teams, the IC needs to consolidate every 10 minutes to prevent decision-divergence. A 30-minute cadence with five teams produces an IC who is hopelessly behind on what each team knows.
The 30-second-per-driver constraint matters. Five teams × 30 seconds = 2.5 minutes of status; the IC has 7.5 minutes per cycle to synthesise, decide, and communicate. Drivers who exceed 30 seconds eat into the IC's processing time; the IC's quality of decision degrades. The constraint forces drivers to compress.
Don't run parallel bridges
The temptation is for each team to run their own bridge "for efficiency." It feels efficient and is the failure mode. Decisions get made on team bridges that contradict each other. Customer comms diverge. Two teams roll back the same change at the same time. One bridge, one source of truth.
The internal mechanism that's appealing about parallel bridges is real: each team gets to discuss in their own context without interrupting other teams. But the cost is decision divergence, which produces incidents-within-the-incident. The decision divergence cost is greater than the context-switching cost.
The hybrid that works. Drivers join the central bridge; their teams continue to discuss internally on their team's normal channels. The driver brings the team's questions to the central bridge and brings the central decisions back to the team. The driver is the bridge between the central coordination and the local context.
Cross-team customer comms
Customer comms during a multi-team incident are particularly fragile. Each team is tempted to publish their own comm because they have specific context. The result: customers receive 3-5 conflicting updates, each technically correct but confusing in aggregate.
The rule. Only the central IC publishes customer comms. Drivers feed the IC with team-specific context; the IC synthesises into a single update. The single update may be longer than usual to cover multiple affected systems; that's fine. One voice is what customers can parse.
The exception is internal status, engineering team channels can have multiple updates from multiple drivers. External (customer-facing) comms remain singular. The distinction prevents the customer-facing chaos while allowing the engineering teams to communicate at their full bandwidth internally.
When to declare a multi-team incident
When two or more teams are paged for the same incident, escalate to the multi-team format immediately. Don't wait. The first 30 minutes of disorganised parallel work costs an hour of wall-clock time downstream.
The signals that confirm multi-team scope. Customer impact spans multiple products or services. The cause is unknown and could plausibly be in any of several systems. A team's investigation reveals a dependency on another team's recent change. Each is a signal that the team-singular response will produce divergence; spin up the multi-team format immediately.
The cost of escalating early is small (one extra Slack message, possibly one extra IC). The cost of escalating late is large (30+ minutes of disorganised work, conflicting customer comms, worse postmortems because the timeline is messy). Escalate early; downgrade if the situation turns out to be single-team after all.
Post-incident across teams
The postmortem is also more complex when multiple teams are involved. Single-team postmortems have a clear owner; multi-team postmortems can fall through the cracks if no team takes responsibility.
The rule. The IC's team owns the postmortem document. Each affected team contributes a section. The contributing-factors discussion is a joint meeting with all team drivers. The action items are distributed across teams with cross-team commitments tracked in the master incident document.
The risk in multi-team postmortems. Each team's section over-emphasises that team's contribution and under-emphasises the cross-team coordination failures. The IC has to push back on this, the most important findings in multi-team postmortems are usually about coordination, not about any single team's technical mistake.
What to do this week
Three moves. (1) Document your multi-team escalation criteria, when does a single-team incident become a multi-team one? Most teams haven't written this down; the absence is what causes 30+ minutes of "is this a multi-team thing?" debate during the actual incident. (2) Identify your IC bench specifically for multi-team incidents. The IC who ran the team-singular incident may not be the right one for a multi-team scenario; have the depth pre-thought. (3) Run a quarterly tabletop with multi-team scope: five teams, fictional incident, the IC and drivers practise the cadence. Costs an hour; produces muscle memory.