Incident Response Intermediate By Samson Tanimawo, PhD Published Aug 7, 2026 5 min read

30-Day Incident Follow-Up: Did the Fix Actually Hold?

The postmortem is week one. The 30-day follow-up is what tells you if the fix worked. Most teams skip it; the same incidents recur; nobody connects the dots.

Week one is not the end

The week-one postmortem captures what happened. The 30-day follow-up captures whether the fixes held. They are different artifacts and skipping the second one means the team is flying blind on whether the work actually worked.

The asymmetry is the problem. Postmortems are well-attended (everyone shows up to the meeting where the incident gets discussed). 30-day follow-ups are poorly attended (everyone has moved on). But the 30-day check is where the LEARNING actually happens; the postmortem captures intent, the 30-day check verifies execution.

Teams that take the 30-day check seriously have markedly fewer recurring incidents. The discipline isn't glamorous, it's a 20-minute meeting where someone reads through last month's action items and asks whether they shipped. The work is small; the impact compounds across years.

Four questions

Did the action items ship? Cite PRs.
Did the same alert fire again in the last 30 days?
Did a similar incident occur (different symptom, same cause)?
Are the metrics around the fix improving, flat, or worse?

Four questions, 20 minutes, attended by the original IC and one outside reviewer. The outside reviewer matters, they catch the IC's blind spots, ask the questions the IC is too close to ask, and confirm the team's view of progress against an external perspective.

Data to pull before the meeting

Pages from the affected service for the last 30 days, grouped by alert. Customer escalations mentioning the symptom. Performance metrics covering the period. The original incident's bridge transcript or summary doc.

The data prep takes ~30 minutes if the data is well-instrumented and a half-day if it isn't. Teams that skip the data prep end up having opinion-based discussions about whether things improved; teams that bring data have evidence-based discussions. The latter produces dramatically better decisions.

Specifically. For pages: a count of pages per day for the affected service, with a chart. The chart often reveals patterns the team hadn't noticed (a weekly spike, a daily peak). For customer escalations: support-ticket count mentioning the symptom keywords. For performance metrics: the SLI that the incident affected, plotted over the 30 days. For the original incident: the timeline and action items so the meeting can compare intent to outcome.

How to spot recurrence

Recurrences rarely look identical. They show up as: the same alert firing on a different service, a customer escalation with similar wording, a metric trending in the wrong direction. Look for SHAPE not exact match.

The pattern recognition is what makes the 30-day check valuable. A team that only flags exact-match recurrences misses 80% of true recurrences (which are variations on the original cause). A team trained to look for shape catches them earlier and can intervene before the second full incident.

Specific shapes to watch for. (1) Same root cause, different symptom: the database connection-pool issue that caused incident 1 might cause incident 2 in a completely different service that uses the same pool pattern. (2) Same symptom, different time window: an incident that fired on a weekly batch job might fire again on a monthly batch job that nobody connected. (3) Adjacent system showing same trend: the metric that wasn't quite over the SLO during incident 1 is now consistently approaching it.

When action items didn't ship

About 30-50% of action items don't ship within 30 days. That's worth knowing. The 30-day check is where the team confronts this honestly.

The conversation. For each unshipped item, ask: was the item the wrong action, or the wrong size, or owned by the wrong person, or blocked by something the team didn't see? Each answer leads to a different follow-up. Wrong action: drop it. Wrong size: re-slice. Wrong owner: reassign. Blocked: escalate.

The trap. The team agrees in the 30-day check that the unshipped items will ship "soon" and then nothing happens. The fix: every unshipped item gets a new commitment date OR a deletion. No ambiguous "we'll get to it." The forced commitment is what produces the next round of shipping.

Metrics improving, flat, or worse

The metric question is often the one teams resist most. It's possible that despite the action items shipping, the underlying metric (page volume, MTTR, error rate) hasn't actually moved. That's important to know.

The honest framing. Action items shipping is INPUT. Metrics improving is OUTCOME. Inputs without outcomes mean the team did the work and the work didn't matter. The 30-day check surfaces this. It's uncomfortable to discover; it's also actionable, because it tells the team that the next iteration of work needs to be different.

The metrics to watch. The SLI that the incident degraded, is it back to or above the pre-incident baseline? The page volume for the affected alert, is it lower than before? The "near-miss" count (pages that almost-but-didn't trigger an incident), has it changed? Each is a different dimension of "did this work."

Rolling learnings into the quarter

If the 30-day check finds the fix held, archive the incident with confidence. If it didn't, escalate the learning into the team's next quarter, the original action items were not enough. The 30-day check is the only mechanism that converts incidents into structural improvements at the team scale.

The escalation. If three consecutive 30-day checks show the same kind of incident isn't being prevented by point-fix action items, the team needs a structural project. Examples: rewrite the alerting layer for the service, rebuild the runbook system, restructure the on-call rotation. The 30-day pattern is what justifies the structural project to leadership.

The opposite case. If the team's last 6 30-day checks all show "yes, the fix held," the team is in good shape and can focus on bigger investments. The 30-day check produces this signal too, confidence that the discipline is working.

Making the cadence stick

The 30-day check is easy to skip. There's no urgent meeting reminder, no customer-facing deadline, no immediate consequence to skipping. Most teams that try to adopt it skip it after 3-6 months unless they wire it into a recurring system.

The wiring. Schedule the 30-day check at the time the postmortem is published. Put it on the calendars of the IC and the reviewer. Make it part of the postmortem template, the postmortem isn't truly closed until the 30-day check has happened. The wiring is what makes the cadence durable.

The other wiring. The EM tracks 30-day-check completion as a team metric. Not "did we do it?", "what did we learn from each one?" The metric forces the team to actually engage with the check, not just attend.

What to do this week

Three moves. (1) For your most recent significant incident, schedule the 30-day check now if it's not already on the calendar. (2) Update your postmortem template to include "30-day check date: [pre-filled]", a single field that makes the check unmissable. (3) For your last 5 incidents from 30+ days ago, do the check retroactively. Even retrospective checks reveal patterns; the team learns more from 5 retrospective checks than from 5 fresh postmortems.