SLO After Major Incidents
Major incidents shift baseline. Adjust.
Review
Most incident retrospectives focus on what went wrong technically. They miss a separate question that is just as important: did the SLO and the burn-rate alerts catch the incident before it became customer-visible? If yes, the SLO practice is working. If no, the SLO practice has a gap that needs closing. Reviewing SLO performance after every major incident is the discipline that keeps the practice honest.
What the post-incident SLO review covers:
- Per major incident, ask: did the SLO predict it?.: Did the burn rate spike before the on-call was paged? Did the SLO dashboard show the issue before customers reported it? If the SLO led the incident detection, the practice worked. If the SLO trailed, the practice missed.
- Often the SLO did not lead.: Many incidents are surfaced by customer reports or by infrastructure-level alerts (out-of-memory, exhausted disk, stuck deploy) before the SLO would have detected them. This is information about both the SLO and the alerting layer; both might need adjustment.
- Map the timeline.: When did the underlying issue start? When did the metric pipeline see it? When did the SLO calculation reflect it? When did the alert fire? When did the on-call respond? Each gap is a place where the practice could be tightened.
- Compare against customer-visible duration.: The customer-visible duration of the incident is the ground truth. The SLO's job is to detect the customer-visible portion, not the underlying technical issue. If the SLO detected only after customers were affected for an hour, the SLO is too lagging.
- Document the gap.: Whatever the gap is, write it up. "The SLO did not catch this incident because the SLI excluded a class of failures." Or: "The burn-rate alert did not fire because the threshold was too loose." The documented gap is the input to the next adjustment.
The post-incident SLO review is unfashionable but high-leverage. It is the practice that catches the SLO practice's own blind spots.
Adjust
The review produces specific adjustments. Maybe the SLO target was too loose. Maybe the SLI definition was incomplete. Maybe the burn-rate alert was too forgiving. Each adjustment is a deliberate change informed by the incident's evidence.
- Maybe the SLO target was too loose.: If the incident produced 30 minutes of customer impact and the SLO budget had 60 minutes remaining at start, the SLO did not constrain the incident at all. Tightening the target would have made the incident a budget event, which would have triggered the team's reliability-investment response.
- Maybe the SLI definition was incomplete.: The incident affected customers in a way the SLI did not capture. A latency-only SLO missed an availability incident; an availability-only SLO missed a slow-but-up incident. The fix is adding the missing dimension to the SLI definition.
- Maybe the burn-rate alert was too forgiving.: The alert threshold was 14x burn rate for 1 hour; the incident had 8x burn rate for 4 hours. The alert never fired even though customers were affected. Tightening the threshold (lower multiplier or shorter window) would have caught this; the trade-off is more frequent alerts.
- Maybe the metric pipeline lagged.: The SLI calculation took 5 minutes to update after each request batch. The incident lasted only 8 minutes; the metric saw the issue for 3 minutes; the alert was barely possible. The fix is reducing the metric pipeline latency.
- Avoid over-correcting.: One incident might warrant tightening; many incidents warrant a structural change. A single bad incident does not always mean the SLO is wrong. The team has to distinguish noise from signal.
The adjustment is concrete. Each post-incident review produces zero or one or two specific changes. Over many incidents, the cumulative changes produce an SLO practice that has been tuned by reality rather than designed by intuition.
Learn
The compounding return on post-incident SLO review is real. Each incident teaches the practice something. Over years of incidents, the SLO model improves until it consistently catches the issues customers experience.
- Each incident teaches the SLO model.: The SLI definitions get tighter. The thresholds get more accurate. The dimensions covered get broader. Each adjustment closes one specific gap; over time the gaps accumulate into closures.
- Compound improvement over years.: An SLO practice that has been tuned by 50 incidents over two years is qualitatively different from one that has not been adjusted since launch. The mature practice catches issues the immature one would have missed.
- Cross-team learning.: The lessons from one team's incidents inform other teams' SLO practices. "Service A's SLO missed an upstream dependency degradation that Service B's caught" is information that helps Service A's team. The cross-team review is part of the practice.
- Better targets over time.: The SLO targets themselves drift toward correctness. A target that was originally aspirational becomes realistic; a target that was originally too loose gets tightened. The targets settle at levels the architecture actually supports.
- Higher operational maturity.: A team with mature SLO practice runs incidents differently. The SLO is the source of truth; the responses are mechanical; the postmortems produce specific gaps to close. The whole reliability culture is healthier.
SLO review after every incident is the discipline that produces a reliability practice that actually works rather than one that looks good on paper. Nova AI Ops surfaces the SLO performance during each incident's window, generates the post-incident review template with the relevant data pre-filled, and tracks the SLO adjustments over time so the practice's improvement trajectory is visible.