The Quarterly Incident Prevention Sprint
Once a quarter, dedicate time to prevention. The sprint format.
Scope
Incident prevention sprint is the discipline of dedicating engineering time to prevention work that pays back across quarters. The sprint is structured, the scope matches incident patterns, and the deliverables are concrete rather than aspirational.
- Top five incident causes from the last quarter. The team reviews the quarter's postmortems and identifies the five most common contributing factors. The sprint's scope is data-driven rather than opinion-driven.
- Engineering work to prevent recurrence. Per cause, the engineering response is identified: better monitoring, automation, refactoring, infrastructure improvement. The output is a concrete change rather than a writeup.
- Per-cause action items with effort estimates. Each cause produces specific action items with effort estimates. Sprint capacity is calibrated against estimates; the plan stays realistic.
- Documented rationale. Why each cause was selected, which action item addresses it, what success looks like. The reasoning survives the sprint and informs the next one.
Commit
Commitment matters. Without protected time, prevention work slides under feature pressure; the discipline includes defending the sprint at the leadership level rather than the IC level.
- Reserve two weeks. Two weeks of dedicated engineering time. The discipline includes not letting feature work consume the calendar in week one.
- Defend against feature pressure. Feature pressure always exists. Leadership commitment is what makes the protection real; the sprint is on the roadmap rather than negotiable.
- Without protection, prevention slides. Unprotected prevention work always loses to features. Accept this, build explicit protection, and stop relying on engineer goodwill.
- Documented commitment. The commitment is in writing. Future sprints reference the precedent; the discipline compounds rather than restarting from zero each quarter.
Track
The sprint's effectiveness is tracked. Recurrence of the addressed causes is the headline metric; without measurement, the sprint becomes a feel-good event rather than an engineering practice.
- Recurrence after the sprint. Measure whether the addressed causes recur in subsequent quarters. If they do, the work was insufficient and the response gets re-scoped.
- Recurrence means the work did not take. Recurrence is a signal, not a failure. The discipline catches it and plans further work rather than pretending the original sprint solved it.
- Adjust. Subsequent sprints adjust based on data: causes that recurred get more attention, new causes get added, the priority order evolves with the incident pattern.
- Track over time and document. Cumulative sprint effects become visible across quarters. The team's incident rate trends; documented tracking justifies continued investment to leadership.
Incident prevention sprint is one of those engineering investments that produces ongoing reliability improvement. Nova AI Ops integrates with incident data, surfaces patterns, and supports the team's prevention discipline.