Why "Five Whys" Fails (And What to Use Instead)
Five whys feels rigorous and frequently produces a single root cause that is more story than truth. The technique was designed for assembly lines. Modern systems break in too many directions for a single chain to hold.
Where five whys came from
Toyota assembly line. A worker stops the line; you ask "why?" five times until you find the bolt that was wrong on Tuesday. The technique works there because the failure is mechanical, the chain is short, and the answer is genuinely a single cause.
The technique's appeal. It's simple, structured, and feels rigorous. Engineers love techniques that produce a clean output. Five whys produces a clean chain ("the bolt was wrong because the inspector skipped step 3 because the checklist was unclear because the manager rewrote it because..."). Each link looks valid in isolation.
The cultural baggage. Five whys arrived with the broader Toyota Production System (kaizen, lean) and benefits from that prestige. Companies adopt five whys because they've heard it works at Toyota; they don't always check whether their failure mode resembles a Toyota assembly line.
Why the linear shape fails on modern systems
Software incidents typically have multiple contributing factors. A bad deploy, an undertested edge case, a missing alert, an on-call who was newer, a runbook that was out of date. None of those alone caused the incident; together they did. Forcing a linear chain picks one and discards the others.
The mathematical issue. Modern distributed systems have failure modes that depend on combinations of conditions. The "why" of a Tuesday afternoon outage might be: traffic spike + caching miss + dependency timeout + retry storm + saturated thread pool. Five separate factors; each insufficient alone; none of them is a "root cause" because the cause is the combination.
The narrative bias. Five whys produces a story. Stories are satisfying; they let the team feel like they understand the incident. The story is usually wrong (or, more precisely, dramatically incomplete). The team that stops at the story feels closure but missed the actual lesson.
The contributing-factors model
List every factor that, if it had been different, would have made the incident shorter, smaller, or non-existent. Do not rank them. Do not pick a winner. Three to seven factors is typical. Each one becomes a candidate for an action item, and the team decides which ones to fix.
The reframing's value. Each factor is a candidate for improvement. Five contributing factors → up to 5 action items. Each action item makes the system more resilient, even if not all five are pursued. Compare to five-whys' single root-cause-and-fix approach — one action item, one improvement.
The discipline of "do not rank." Engineers want to identify "the" cause. The desire is real; the right answer is to resist it. Ranking implies one factor mattered most; in distributed-system incidents, the ranking is usually unprovable and arbitrary. Treat factors as a set, not a list.
Three or four causes is normal
If a postmortem identifies one cause, the team has not looked hard enough. If it identifies fifteen, the team is grieving, not analysing. The honest range is three to seven. That is not "we cannot decide"; that is "the system actually has three to seven joints that bent."
The lower bound. Single-cause postmortems are usually missing context. The incident happened in a system; the system has many surfaces; the failure traversed multiple. The postmortem that identifies just one cause has stopped at the most obvious one.
The upper bound. Twenty-cause postmortems are usually emotional rather than analytical. The team is processing the incident's pain by listing every grievance about the system. Some of those grievances are real; many are unrelated. The IC's job is to keep the list focused on factors that actually contributed.
Honest language
"A contributing factor was X" is a stronger claim than "the root cause was X." It commits to the existence of X as a problem worth fixing without committing to the (usually false) claim that X alone caused the incident. The postmortem reads as more thoughtful and ages better in the hindsight reread.
The aging benefit. Six months later, a postmortem that claimed "the root cause was X" reads as overconfident if X turns out to be one of several factors. A postmortem that listed "contributing factors included X, Y, Z" ages better — even if some factors were wrong, the framing was honest.
The phrase library. "Contributed to the incident's duration" (factor extended the incident). "Made detection harder" (factor delayed response). "Worsened the customer impact" (factor increased severity). Each phrase locates the factor's role precisely without claiming it was the cause.
A worked example
Incident: customer-facing search failed for 47 minutes. Five whys version: search failed because the cache returned bad data because a backfill job wrote bad data because the schema wasn't validated because... (truncated narrative ends with "we need better validation"). Contributing factors version: bad backfill data; cache didn't have validation; alerting didn't catch the empty-search-result rate; runbook for cache reset didn't exist; on-call hadn't seen this pattern before. Five action items vs one.
The worked-example test. After writing the postmortem one way, write it the other way. Which version produces more useful action items? The contributing-factors version almost always wins.
Common antipatterns
Five whys with five factors disguised as a chain. Engineer writes "because A, because B, because C..." but A B C aren't actually causally linked. The format is five whys; the content is contributing factors. Just call it contributing factors.
The "the root cause was human error" finale. Five whys terminates at a person ("because Sara forgot to check"). Always reframe to system: "Sara forgot to check because the checklist didn't include this item." The system layer is where fixes live.
The 15-cause postmortem. Team feels every grievance must be listed. Most of them aren't really contributing factors. The IC's discipline is to ask "would changing this have reduced the incident's impact?" Factors that don't pass the test don't go in the list.
Contributing factors with no action items. Listed as observations, never owned. The list is just venting unless each factor (or a subset) becomes an action item with an owner.
What to do this week
Three moves. (1) Read your last 3 postmortems. How many used "root cause" language vs "contributing factors"? Most teams find "root cause" dominates. (2) For your next retro, ban "root cause" from the discussion. Force the team to use "contributing factors" instead. The shift in vocabulary changes the analysis. (3) Update your postmortem template to require listing 3-7 contributing factors as a section.