Burn Rate vs SLO Burn-Down
Two related but distinct concepts.
Burn rate
Two related concepts often get conflated in SLO discussions: burn rate and total burn-down. They measure different things, alert on different conditions, and inform different decisions. Treating them as the same metric loses signal; tracking both separately produces a much clearer picture of what the SLO is doing.
What burn rate actually is:
- Speed of consumption.: Burn rate is how fast the error budget is being consumed, measured as a multiplier of the rate at which the budget would be consumed at the SLO target itself. A 1x burn rate means consumption matches the SLO; the budget refills over the window. A 14x burn rate means consumption is 14 times faster than sustainable.
- Per moment in time.: Burn rate is an instantaneous measurement (computed over a short window like 1 hour or 1 day). It tells you what is happening now, not what has happened cumulatively. A spike in burn rate is the leading indicator that the system has shifted.
- Independent of total budget remaining.: A high burn rate at the start of the window matters more than the same burn rate near the end of the window, because the start has more budget to burn. But the rate itself is the same. Burn rate measures velocity, not position.
- Maps to SRE math.: The Google SRE book defined burn rate alerts at specific thresholds (14.4x for 1 hour, 6x for 6 hours, 3x for 3 days, 1x for sustained). Each threshold corresponds to a "you will exhaust your budget by X if this continues" projection. The math is the foundation.
- Drives short-term incident response.: A high burn rate fires alerts. The on-call investigates. The system stabilizes. The burn rate drops. This is the operational signal that an incident is happening or recovering, in real time.
Burn rate is the operational metric. It is what the on-call watches; it is what alerts fire on; it is what the deploy gate measures during canary.
Burn-down
Total burn-down is different. It is the cumulative consumption of the error budget across the SLO window. It tells you what fraction of the budget has been spent and how much remains. The burn-down chart is what stakeholders look at to see whether the SLO will be met for the period.
- Cumulative consumption.: Burn-down sums every minute of bad-experience-equivalent across the window. A 30-day window starts with 100% budget; each minute of failures consumes some fraction; the running total is the burn-down. The chart slopes downward as time progresses; the slope is the average burn rate.
- Across the SLO window.: Burn-down is computed over the full SLO window (typically 30 days). It captures the integral of failures over time. A 4-hour outage shows up as a step down in the chart; a sustained 1.5x burn rate shows up as a steeper slope.
- Predicts whether SLO will be met.: Looking at the slope and current position, you can project whether the SLO will be met for the remainder of the window. A team currently at 30% remaining with 10 days to go is at risk; the same team at 50% remaining with 25 days to go is fine.
- Drives medium-term planning.: Burn-down feeds the deploy/freeze decision. The error budget policy fires when burn-down crosses a threshold. The reliability sprint priority follows from how much budget is being consumed and where.
- Visible to stakeholders.: The burn-down chart is what executives, customers, and product see. It tells the story over the period rather than the moment. The aggregate is more useful at executive altitude than the instantaneous rate.
Burn-down is the strategic metric. It is what informs investment decisions, customer comms, and SLO-policy enforcement.
Alert
The right alerting strategy uses both metrics. Burn rate alerts catch the sudden incidents; burn-down alerts catch the sustained drift. Each catches what the other misses.
- Rate alerts catch sudden issues.: A 14x burn rate fires within minutes of an incident starting. The on-call gets paged; the response begins. The alert is responsive to short-term events; it does not wait for the cumulative impact to be obvious.
- Burn alerts catch sustained issues.: A consistent 1.5x burn rate over two weeks does not trigger any rate-based alert (the rate is barely above baseline) but produces a 30% over-target burn-down. The team is on track to miss the SLO; the burn-down threshold catches this even when no single incident does.
- Layer the alerts.: Both alert types are wired up. Rate alerts page on-call for live response; burn-down alerts open tickets for medium-term investigation. The two layers do not duplicate each other; they catch different patterns.
- Different audiences for different alerts.: Rate alerts go to on-call. Burn-down alerts go to the service team's primary owner and the engineering manager. The escalation path differs because the timeframe and the action differ.
- Combined dashboard.: The SLO dashboard shows both metrics. Burn rate as a real-time number; burn-down as a chart trending toward exhaustion. The team can read both at a glance and understand both the live state and the trajectory.
Burn rate and burn-down are complementary, not interchangeable. Nova AI Ops computes both per service, alerts on each independently with appropriately tuned thresholds, and surfaces them together on the dashboard so the team has the right signal for both live response and medium-term planning.