SLO by Service Tier
Tier 0 services have stricter SLOs.
Tier 0
Not every service deserves the same SLO target. The most expensive failure mode in reliability practice is treating a 50-service fleet as if it were one undifferentiated thing, then setting a single SLO target for all of them. The result is overinvestment in services nobody cares about and underinvestment in the ones that drive revenue. Tiering services by criticality fixes this.
What Tier 0 means in a tiered SLO model:
- Customer-critical and revenue-path.: Tier 0 is the services whose failure stops the business. Payment processing, authentication, order placement, the core data plane that customer-facing UIs depend on. If these are down, customers cannot transact.
- 99.9% or higher availability.: The minimum target. 99.9% allows roughly 43 minutes of downtime per month, which is the upper bound of tolerable customer impact for revenue-critical services. Higher tiers (99.95%, 99.99%) apply where the business case justifies the cost.
- Latency p99 under 500 ms.: Customer-facing performance targets at Tier 0 are tight because users perceive latency as a quality signal. Slow responses on a payment screen feel like the service is broken even when it is technically working.
- Multi-region failover required.: Tier 0 cannot depend on a single region. The SLO target requires architecture that survives the loss of an entire region without breaching availability. This is why Tier 0 is expensive: the redundancy is not optional.
- 24/7 oncall, paged on burn rate.: The on-call rotation is staffed continuously. Burn-rate alerts page within seconds of detection, regardless of time of day. The expected acknowledgment time is measured in minutes, not hours.
- Highest deployment gates.: Canary deploys, soak windows, automated rollback, manual approval for high-risk changes. Every protective mechanism is enabled by default. Tier 0 services move slower deliberately.
Tier 0 services are where reliability investment concentrates. Most organizations have only a handful: maybe 5 to 15 services in a 100-service fleet. The discipline of identifying which they are is the foundation of the tier model.
Tier 1
Tier 1 is the next layer down: services that affect customers but where a brief failure does not stop the business. Search, recommendations, dashboards, profile pages, secondary features. The SLO target reflects that the impact of failure is real but recoverable.
- Customer-affecting but recoverable.: When Tier 1 fails, customers notice but the business does not stop. They see degraded search results, missing recommendations, slow dashboards. The user experience is worse but they can still complete their core transactions.
- 99.5% availability.: Allows roughly 3.5 hours of downtime per month. The looser target reflects that the cost of a brief outage is operational annoyance rather than lost revenue. Investment level matches.
- Latency p99 under 1 second.: Looser than Tier 0 latency. Users tolerate longer responses in features that are not on the critical path. The bar is "not visibly broken," not "instant."
- Single-region acceptable, multi-AZ required.: Tier 1 services run multi-AZ for resilience but do not require multi-region. Loss of a region is a degraded mode, not an outage. The architecture is cheaper than Tier 0.
- Business-hours oncall with after-hours fallback.: The on-call rotation responds in business hours. After hours, alerts route to a less-aggressive page that acknowledges within 30 to 60 minutes. The cost of staffing is materially lower than Tier 0.
- Standard deployment gates.: Canary or rolling deploys with automated rollback. Less paranoid than Tier 0 but still gated. The deploys can move faster because the consequences of a regression are smaller.
Tier 1 services make up the bulk of most engineering fleets. The standard deployment patterns and standard reliability investments apply here, optimized for routine operation rather than for the worst-case scenario.
Tier 2
Tier 2 is the bottom layer: services that exist for internal users or non-critical purposes. Internal admin tools, batch reporting, dev environments, internal APIs that other engineering teams use but customers do not see directly.
- Internal users only.: The audience is engineering, support, finance, or operations. Customers do not interact with these services directly. A failure inconveniences employees but does not stop revenue.
- 99% availability.: Allows roughly 7.2 hours of downtime per month. Generous by external standards, appropriate for the impact level. Most internal tools genuinely do not justify higher targets.
- Latency p99 under 2 seconds.: Loose latency budget. Internal users tolerate slower responses for non-critical work. The target is "not painfully slow," not "fast."
- Single-AZ acceptable.: The redundancy investment is not justified by the impact level. Tier 2 services can run in a single AZ; the architecture cost stays low.
- Business-hours support.: The on-call rotation handles Tier 2 only during business hours. After-hours failures sit in a queue until the next morning. Engineers know to use Tier 1 fallbacks if a Tier 2 tool is down.
- Looser deployment.: Rolling deploys without canary, no manual approval, no soak window. Deploys move at the team's natural cadence because the cost of a regression is small.
Tier 2 services are where engineering velocity matters more than reliability investment. Treating them like Tier 0 wastes engineering capacity; treating them like nothing produces support escalations from internal users. The middle ground is acknowledging them as Tier 2 explicitly. Nova AI Ops tracks tier classification per service, applies tier-appropriate SLO targets, and surfaces the tier-mismatch cases (a Tier 2 service that has become customer-facing without anyone updating its tier) before the architecture and the operational posture diverge from reality.