Burn Rate Formula
Burn rate = errors / budget × time-window.
Calculate
Burn rate is the metric that translates SLO error budget consumption into a number you can alert on. The formula is straightforward; the discipline is in applying it consistently across services and reading the result correctly. A team that understands burn rate has a better intuition for SLO health than a team staring at percentage numbers.
What the calculation actually is:
- Per service: actual error rate / acceptable error rate.: The burn rate is the ratio of how fast errors are accumulating to how fast they would accumulate at the SLO threshold. If your SLO allows 0.1% errors and you are seeing 1.0% errors, your burn rate is 10. You are burning budget 10 times faster than the budget can sustain.
- Burn rate greater than 1 means exhausting.: Any burn rate above 1.0 means you are consuming budget faster than you can afford long-term. A sustained burn rate of 1.5 for the entire month would consume 1.5x the available budget. The number directly maps to how dire the situation is.
- Window matters.: The burn rate is calculated over a specific time window (5 minutes, 1 hour, 6 hours, 1 day). The same incident produces different burn rates at different windows. Short windows catch spikes; long windows catch drift.
- Acceptable error rate is fixed.: The denominator (acceptable error rate) is the SLO target translated into errors per request. It does not change with the window. Only the numerator (observed error rate) changes.
- Calculations are per-service.: Each service has its own SLO and its own burn rate. The aggregation across services is not a single burn rate; it is a list of per-service burn rates. The list shows where the budget is being spent.
The formula is the foundation. Every alert and every dashboard about burn rate builds on this calculation.
Alert
The simplest burn rate alert pages when the burn rate exceeds a threshold for a sustained window. The threshold and window are tuned together; the goal is to catch incidents fast without firing on every momentary blip.
- Burn rate over 14x for 1 hour: page.: A burn rate of 14 sustained for an hour means budget is being consumed 14 times the sustainable rate. At that pace, a 30-day budget is exhausted in about 2 days. The incident is real; the page is justified.
- Catches sustained drift fast.: The 14x threshold and 1-hour window are calibrated to catch incidents with confidence. A momentary spike does not exceed the window; a sustained issue does. The signal-to-noise is favorable.
- SRE workbook standard.: The 14.4x threshold (specifically) for a 1-hour window comes from the SRE workbook math: it consumes 2% of a 30-day budget in 1 hour, which is the action threshold for a 28-day window. The number is not arbitrary.
- Faster window for criticals.: Higher-tier SLOs use a 5-minute window with a higher threshold (around 36x). The faster window catches incidents earlier at the cost of a tighter signal-to-noise margin.
- Slower window for trends.: Slower windows (6 hours, 1 day) with lower thresholds catch sustained drift that would not trigger the fast window. The combination of fast and slow alerts produces full coverage.
The single-window burn rate alert is a starting point. Most production teams move to multi-window quickly because the precision improves significantly.
Multi-window
Multi-window burn rate alerting requires the burn rate to exceed thresholds at multiple windows simultaneously. The combination is more precise than any single window; it catches both spikes and drift with fewer false positives.
- Different windows, different thresholds.: A typical configuration: 14x at 1 hour AND 14x at 5 minutes triggers a page. The 5-minute window catches the immediacy; the 1-hour window catches the sustainment. Both must agree before the alert fires.
- Catches both spikes and drift.: A spike that resolves in 10 minutes does not maintain the 1-hour burn rate; a slow drift does not exceed the 5-minute threshold. The combination triggers when both conditions are present, which is when an incident is real.
- Page tier and ticket tier.: Different threshold combinations route to different actions. Critical thresholds page; medium thresholds open a ticket without paging. The tier mapping makes the alert load sustainable.
- Tested before deployment.: Before turning on multi-window alerts, the team back-tests them against historical incidents. Did the alert fire when it should have? Did it stay silent when it should have? The back-test produces confidence in the configuration.
- Calibrated over time.: The thresholds are not static. As the service matures and the SLO definition tightens, the thresholds adjust. The calibration is part of the SLO review cadence.
Burn rate is the lingua franca of SLO operations; teams that internalize the formula and the multi-window pattern produce alerts that catch real incidents and stay quiet otherwise. Nova AI Ops integrates with SLO platforms, calculates burn rate per service, and tunes the multi-window alerts so the team gets the right page at the right time.