SLO Incident Correlation
Incidents and SLO breaches.
Tracking SLO impact per incident
Tying every incident to SLO budget consumption turns abstract reliability work into a concrete number. Budget consumed, services affected, and duration recorded per incident produce a comparable artefact across the year.
- SLO impact recorded per incident. Budget-consumed, services-affected, duration triplet captured during postmortem; the math becomes concrete.
- Postmortem budget table. "Consumed X percent of monthly budget" line in every postmortem; comparable across incidents and quarters.
- Quarterly aggregate by cause class. Cause-class consumption view per quarter surfaces patterns individual incidents hide.
- Documented impact per incident. Named budget-consumption number per incident; supports honest reporting rather than vibes.
Aggregating to find patterns
Aggregation surfaces the patterns individual incidents hide. Top causes, top contributing services, and time-of-day concentration each tell a different story about where investment should land.
- Top causes per quarter. Deploy-related, dependency-failure, configuration, capacity breakdown; the dominant cause class drives priority.
- Top contributing services. Per-service contribution; the same service repeatedly burning budget points to architectural fragility, not bad luck.
- Time-of-day patterns. Deploy-window, peak-traffic, off-hours concentration; the timing tells you where staffing or process changes the curve.
- Named pattern owner per quarter. Responsible analyst per cycle; "we never actually looked across incidents" is the failure mode without an owner.
Investment decisions from patterns
Patterns drive investment decisions. Cause class drives engineering work; service drives architectural review; staffing decisions follow time-of-day concentration.
- Top cause class drives engineering. 60 percent deploy-related means deploy reliability is the quarter's priority; the data picks the work.
- Top service drives architectural review. Repeated incidents in the same service warrant structural change, not yet another tactical fix.
- Staffing follows time-of-day concentration. Off-hours burn pattern may indicate the need for follow-the-sun coverage or schedule changes.
- Documented driver per decision. Named pattern-to-investment linkage in writing; supports honest prioritisation when the next quarter argues differently.
The correlation dashboard
The dashboard makes the patterns visible. Burn-down by cause, per-service contribution, and recent budget-impacting incidents render the analysis as a single view.
- Quarterly SLO budget burn-down. Stacked-by-cause-class view; trend visibility per quarter without manual aggregation.
- Per-service contribution table. Sortable per-service view; drill-down to specific incidents supports investigation.
- Recent budget-impacting incidents. Live recent-incident list; quick reference for ongoing context during reviews.
- Named owner per dashboard. Responsible reliability lead per org; stale or wrong dashboards become misleading rather than informative.
Review cadence
Reviews run at three cadences. Monthly for trend-spotting, quarterly for investment decisions, annual for reliability strategy. Each cadence answers a different question.
- Monthly SLO review. Incident correlation included in the monthly cycle; trends surface while there is still time to act on them.
- Quarterly engineering review. Investment-decision review per quarter; engineering hours follow the data rather than the loudest voice.
- Annual reliability strategy. Multi-quarter pattern review feeds multi-year roadmap; one quarter is noise, four quarters is signal.
- Documented output per cadence. Decisions or actions named at every review; "we reviewed but didn't decide" is the failure mode.