Best Practices Intermediate By Samson Tanimawo, PhD Published Dec 9, 2025 6 min read

Swarming vs Incident-Command: Two Models, Two Stages

Some teams swarm: everyone joins, everyone debugs, and chaos works. Others use formal incident-command: one IC, named roles, a script. The right answer depends on team size and incident maturity.

Two shapes

Swarming: everybody available joins the call, debugs in parallel, and the loudest accurate engineer drives. Incident command: one Incident Commander runs the call; named roles report to them; the team executes a checklist. Both can resolve incidents; they fail differently.

The swarm shape. Cheap to set up (everyone joins), high coordination overhead during the incident, low postmortem rigour. Works when team size is small, incidents are familiar, and the team has shared context. Fails when scale increases or incidents are novel.

The IC shape. Higher upfront investment (training ICs, learning the protocol), lower coordination overhead during incidents, higher postmortem rigour. Works at any scale. The cost is cultural — the team has to commit to the protocol even when it feels like overhead.

Swarming

Works when the team is small (under 15 engineers), the incident space is familiar, and people know each other's strengths. Cheap. Fast for small fires. Falls apart when the team grows or the incident is one nobody has seen.

The swarm's strengths. Everyone's expertise is available immediately. No coordination cost; engineers just work in parallel. Postmortem is "we all remember what happened" — works for small teams. The shared experience builds team cohesion.

The swarm's failure modes. Senior engineer dominates and others defer. Two engineers debug the same thing without realising. Customer comms gets forgotten because nobody owns it. Decisions get made by the loudest person on the call regardless of whether they're right.

Incident command

Works at any team size, scales to multi-team incidents, and produces better postmortems because someone was responsible for taking notes. Costs: time-to-train, willingness to follow a script under pressure, and the discomfort of moving from "everyone helps" to "do what the IC says."

The IC's strengths. Clear ownership of decisions. Customer comms always sent (someone's job). Postmortem timeline is in the channel because the IC was scribing. Multi-team incidents are tractable because the IC is the integration point.

The IC's costs. Engineers hate doing IC the first 3-5 times; the role feels awkward and counter-productive. Senior engineers especially struggle because they want to debug, not coordinate. The cultural transition takes 6-12 months for most teams.

Team size matters

Swarming breaks at around 15 engineers and is broken by 30. The maths: in a swarm of 30, it is cheaper for each engineer to ask "what is happening?" than to read the channel. The channel becomes a status board; the IC role emerges naturally even if you did not declare it.

The threshold's mechanism. Below 15, every engineer can hold the team's state in their head. Above 15, no individual can; coordination overhead exceeds the benefit of parallel work. The team that doesn't introduce IC at this point ends up with chaotic incidents.

The acquisition signal. If your team is past 15 engineers and incidents feel chaotic ("five people debugging the same thing", "nobody told the customer", "we don't know who's in charge"), the team has outgrown swarming. Time to formalise.

The four roles

Incident Commander: makes decisions, owns timeline, does not debug.
Communications Lead: customer comms, internal updates.
Operations Lead: drives the actual remediation.
Scribe: timeline, decisions, who did what when.

The roles can collapse on small teams. IC + Comms + Scribe is one role on a 5-person team (the same person does coordination, comms, and writing). Operations Lead is separate (driving the technical work). The four roles are a maximum; collapse based on team size.

The discipline of separation. Even when one person fills multiple roles, they do them sequentially with awareness ("now I'm scribing", "now I'm sending the customer update"). The discipline is what prevents one role from being silently dropped.

Staged transition

Most teams swarm at first. They formalise IC roles when the second multi-team incident happens. They train rotating ICs when their fourth all-hands fire shows the same engineer always ending up as IC by default. Each stage is a forcing function; do not skip them.

Stage 1 (5-15 engineers): swarm. Cheap, works, builds team. Don't introduce IC prematurely; the overhead exceeds the benefit at this scale.

Stage 2 (15-30 engineers): introduce IC role for SEV1 only. Document the four-role protocol; train senior engineers as ICs. Smaller incidents continue to swarm.

Stage 3 (30+ engineers): IC for SEV1 and SEV2. Rotating IC pool of 5+ trained ICs. Quarterly tabletop exercises to keep the muscle.

Stage 4 (multi-team): IC for any cross-team incident regardless of severity. Per-team drivers reporting to single IC. Tested cross-team escalation paths.

Common antipatterns

Premature formalisation. 8-engineer team adopts the four-role protocol. Each incident has more roles than incidents. The protocol becomes theatre. Wait for the team to outgrow swarming before formalising.

The "designated IC who's always the same person." Senior engineer takes IC every time. They burn out. They become a bottleneck. The IC role must rotate; if it doesn't, the team has a senior-engineer dependency.

The IC who debugs. Senior IC can't resist the technical work; abandons the IC role mid-incident. Bridge falls apart. Either commit to IC or hand it off; can't do both.

Skipping the protocol on "small" incidents. "This is just a small thing, no need for IC." The small thing turns out to be bigger; nobody's in charge; chaos. If the incident reaches SEV2+, use the protocol.

What to do this week

Three moves. (1) Honest assessment: which stage is your team at? Most teams underestimate their stage by one (think they're at stage 1 when they're at stage 2). (2) Identify your next investment: is it documenting the protocol (stage 1→2), training rotating ICs (stage 2→3), or testing multi-team coordination (stage 3→4)? (3) Schedule the next quarterly tabletop. The exercise is the forcing function for IC training.