Incident Response Advanced By Samson Tanimawo, PhD Published Aug 11, 2026 6 min read

Debugging an Incident That Won't Resolve

Two hours in, every theory has been ruled out, the symptom persists, and the team is tired. The debugging moves that work when nothing has worked.

The 90-minute wall

By minute 90, the team has formed three theories, ruled them all out, and is starting to repeat itself. The energy in the bridge drops. Engineers start flipping between Stack Overflow and the same dashboard they have refreshed 40 times. The stuck-incident wall is real.

The wall has a shape. Engineers' attention narrows: they keep looking at the same metrics, asking the same questions, considering the same theories. Cognitive load is high; ability to take in new information is low. Each new fact gets interpreted in terms of the existing theories rather than as evidence against them.

The fix is to break the narrowing. Each of the four moves below is a deliberate reset that forces the team to take in new information or to re-form theories from scratch. Without one of these resets, the team continues to grind on theories that have already failed for another 30-60 minutes before someone breaks the pattern by accident.

Four moves to make when stuck

Each one is a deliberate reset. They're listed in increasing order of friction; try the first before the second.

Move 1, Rebuild the timeline from scratch

Get a fresh whiteboard or doc. Write every event with a timestamp: alerts, deploys, traffic spikes, dependency outages. Most stuck incidents resolve when the team realises something happened 15 minutes BEFORE the symptom that nobody had connected.

The mechanism. The team's mental timeline at minute 90 is shaped by the order in which they investigated, not the order events happened. A migration that ran at minute -15 (before the incident) but was discovered at minute 60 lives in the team's memory at "minute 60-ish." Forcing a chronological rebuild surfaces these reordered events; the migration that really started at minute -15 is then connected to the symptoms that started at minute 0.

The protocol. The IC asks each engineer to write down every event they remember with timestamp. The IC consolidates onto a fresh whiteboard. New facts emerge in the consolidation, "wait, the migration started before the incident?" The team's mental model updates and the next 15 minutes of investigation are aimed at the right thing.

Move 2, Yesterday's changes

Not "what changed in the last hour", that's been chased. What changed in the last 24-48 hours. Migrations, vendor updates, third-party releases, certificate rotations. Slow-cooking changes often catch fire later.

The pattern. Half of stuck-incident root causes turn out to be changes from 12-72 hours before the symptom. Vendor patches roll out with delay; database migrations take time to manifest under load; certificate rotations break clients that hadn't refreshed; feature flags ramp up gradually. By minute 90, the team has dismissed "recent changes" because they think they checked. They checked changes from the last hour; the change from yesterday is still in scope.

The protocol. Pull the deployment log, the vendor maintenance announcements, the cert-rotation log, the feature-flag config history, all for the last 48 hours. Cross-reference against the symptom. Most stuck incidents have a smoking gun in this 48-hour window; the team just hadn't looked at the right log yet.

Move 3, Page someone outside the team

Pull in the network engineer, the database engineer, or the platform engineer the team hasn't talked to in months. They will ask basic questions the bridge stopped asking 60 minutes ago. Often the answer is in those questions.

The benefit isn't expertise. The outside engineer doesn't know your system better than you do. The benefit is FRESH PERSPECTIVE. They haven't built up the assumptions the in-team people have. They ask "wait, you're sure this isn't a DNS issue?" and the team realises they assumed DNS was fine 80 minutes ago without actually checking. The naive question reveals the missed verification.

The protocol. The IC pages a senior engineer from a different team, briefs them in 90 seconds (symptom + theories ruled out + current direction), and explicitly asks: "what would you check that we haven't?" The outside engineer's first 5 questions are usually the most valuable; capture them, run them down.

Move 4, Accept it's two incidents

The original symptom may have been resolved at minute 45 and the team is now chasing a different symptom they didn't notice. Stop. Re-state the current symptom in customer terms. Often it's not what the bridge has been chasing for an hour.

The pattern. Incident A starts; team starts investigating. At minute 45, incident A resolves on its own. Three minutes later, incident B starts (maybe related, maybe coincidental). The team continues investigating, not noticing that the symptom has shifted. By minute 90, they're chasing a theory that fits incident A's symptoms but not incident B's; the theories don't match the data because the data has changed.

The protocol. The IC asks: "what is the current customer-visible symptom RIGHT NOW?" Not 30 minutes ago, not at incident start. Right now. Check the dashboards as they exist this minute. If the answer is different from the symptom the bridge has been chasing, the team is in a new incident and should re-form theories from scratch.

Escalating vs pausing

The instinct after 90 minutes is to escalate ("get the VP of engineering on the call"). Often the right move is to pause the bridge for 15 minutes, let the team eat, and resume. A tired team escalates noise; a rested team escalates signal.

The pause protocol. The IC says: "We're going to take 15 minutes. Everyone gets food, walks outside, comes back at HH:MM. We'll regroup with fresh eyes." The pause feels expensive, every minute is customer-impacting, but the team's effectiveness post-pause is dramatically higher than its pre-pause grind.

The exception: SEV1 incidents where customer impact is severe. Don't pause those, push through. But for SEV2 and slow-burning SEV1s, the pause is usually the right move past minute 90.

What stuck incidents teach

Stuck incidents are always system feedback. The system is telling you something about your monitoring (you missed a leading signal), your runbooks (you didn't have one for this case), or your architecture (the failure mode wasn't anticipated). The 90-minute wall is the system saying "you don't understand me as well as you thought."

The postmortem from a stuck incident is therefore unusually high-leverage. It captures real system-knowledge gaps. Treat the contributing-factors section with extra rigor; the gaps it surfaces are gaps that probably affect other systems too. The action items are usually broader than for a typical incident, instrument missing signals, write the missing runbook, add observability to the failure mode.

Teams that take stuck-incident postmortems seriously have fewer stuck incidents over time. The work compounds: each stuck incident reveals a system-knowledge gap, fixing the gap prevents the next stuck incident in that area. After a year of disciplined post-stuck-incident work, the team's "stuck-incident rate" drops dramatically.

What to do this week

Three moves. (1) Pin the four-moves checklist in your incident channel. The next time you're past minute 90, the IC reaches for the checklist instead of improvising. (2) Audit your last 3 stuck incidents (over 90 minutes). Which of the four moves would have helped earlier? Use that to train the IC bench. (3) Schedule a quarterly "stuck incident retrospective", a meta-review of incidents that hit the wall, looking for patterns in what causes the wall in YOUR system specifically.