Multi-Region SLO Rollup
Aggregate region SLOs into global.
Math
The naive way to compute a global SLO from regional ones is to take the average. It is also wrong, in a way that almost guarantees you will report a number that does not match user experience. The right way is a traffic-weighted aggregate, and the difference between the two can be the difference between a healthy report and a hidden customer-impacting outage.
Why straight averaging breaks:
- Regions have unequal traffic.: US-East serves 60% of your requests, EU-West 25%, AP-South 15%. A regional outage in AP-South affects 15% of users. A regional outage in US-East affects 60%. Averaging the regional SLOs treats both as equally bad. Your customers do not.
- Weighted by traffic, not by region count.: The global SLO numerator is the sum of successful requests across all regions. The denominator is the sum of all requests. The ratio is the only honest aggregate. Region count does not enter the math.
- Time-of-day matters.: Traffic shifts across regions over the course of a day. The weights at 3 AM UTC are different from the weights at 9 PM UTC. Your aggregate calculation must use the actual per-minute traffic distribution, not a static weight.
- Rolling window stays consistent.: The SLO window (28 days, calendar month) stays the same; what changes is the per-region contribution within that window. Compute the weighted aggregate freshly at each report, not from cached weights.
The traffic-weighted aggregate is the only number that matches what customers actually experienced. It is also the only number that survives audit when a stakeholder asks "show me the math behind that number."
Display
Once the math is right, the dashboard has to reflect both the global picture and the regional decomposition. A dashboard that shows only the global number hides regional issues. A dashboard that shows only regional numbers makes it impossible to answer "are we hitting our overall commitment."
- Global tile up top.: The traffic-weighted global SLO is the headline. This is the number that gets quoted to leadership, that goes on the customer-facing status page, that drives the deploy/freeze decision.
- Per-region tiles below.: Each region has its own tile showing its own SLO performance, its traffic share, and its budget remaining. The user can scan and see "everything is green" or "AP-South is below target."
- Drill-down on every tile.: Click a regional tile and see the per-service breakdown within that region. Click a service and see the per-endpoint breakdown. The dashboard supports investigation, not just summary.
- Side-by-side time series.: Charting the global SLO and the per-region SLOs on the same time axis lets you see when they diverge. Most regional incidents show up as a regional tile dipping while the global tile barely moves, which is exactly the case the global number alone hides.
- Show traffic share, not just availability.: If AP-South is at 90% but only carries 5% of traffic, the global impact is 0.5%. Customers in AP-South still experience 10% errors. The dashboard must show both numbers so the team can decide whether the regional issue rises to a global response.
The display layer is what turns the rollup math from a single number into a tool the team can actually act on.
Alert
Alerting on a global rollup is exactly the case where you will sleep through a regional outage. The alert thresholds have to be region-aware so a single region's failure pages someone, even when the traffic-weighted global metric stays green.
- Per-region SLO alerts.: Each region's SLO has its own burn-rate alert. AP-South going from healthy to burning fires its own page even if the global aggregate barely moves. The on-call gets the regional context immediately.
- Global aggregate alerts on full-fleet drift.: The global SLO alerts on slow drift across the whole fleet, the kind of issue no single region triggers. Both layers fire independently. They catch different failure modes.
- Don't drown in global noise.: A single region recovering from an incident contributes a flat band of failed requests to the global metric. If your global threshold is too tight, you alert on every regional incident's tail. Tune the global alert to fire only when multiple regions are degraded simultaneously, or when a single region's burn is large enough to threaten the global budget.
- Route by region.: Regional alerts route to the regional on-call (or to a global on-call with regional context). The page should tell the responder "AP-South is degrading" not "the global SLO is at 99.4% across all regions" because the latter takes more cognitive load to interpret.
- Hold the rollup honest.: A region that is consistently at the bottom of the regional SLO range is exposing a structural issue that is consuming part of your global budget every month. The rollup alert should fire on consistent contribution, not just on incidents.
Multi-region SLO rollups done right give you both the executive headline and the operator's working surface. Nova AI Ops computes traffic-weighted global SLOs across regions, surfaces per-region drilldown on the same dashboard, and routes alerts to the right on-call by region so a regional issue does not need a global page to wake the right person.