SLO Impact on Architecture
Tight SLOs drive architectural choices.
Redundancy
SLO targets do not just measure a system; they shape it. The architecture you can build at 99% availability is fundamentally different from the one you have to build at 99.99%. Each added "9" forces design choices that ripple through every service. Understanding which architectural patterns the SLO target requires is the difference between committing to a number and being able to deliver it.
What tight SLOs force in the architecture:
- Tight SLO requires multi-region or multi-AZ.: 99.9% availability allows about 43 minutes of monthly downtime, which a single AZ failure consumes in one event. To hit 99.9% sustainably, the architecture must survive an AZ loss without breaching budget. To hit 99.99%, it must survive a region loss. The redundancy follows from the math.
- Cost is real.: Multi-AZ doubles infrastructure cost for the redundant tier. Multi-region typically triples it because of replication, cross-region traffic, and the operational overhead of synchronizing state. The cost is justified for revenue-critical services and unjustified for most others.
- Failover testing required.: Redundancy that has not been tested is theatrical. The architecture must include regular failover drills (game days) that exercise the redundant path. A multi-AZ setup that has never had its primary AZ taken down is a multi-AZ setup you cannot trust.
- Stateful systems are the hard part.: Stateless services replicate trivially across regions. Databases, queues, and caches do not. The SLO target shapes the choice of storage system: a 99.99% target essentially mandates a multi-region-aware database (Spanner, CockroachDB, or carefully-managed primary/replica with failover automation). The storage choice is the most expensive decision the SLO drives.
- Network architecture follows.: Tight SLOs require redundant DNS, redundant load balancers, redundant ingress paths. Each layer has to survive the failure of one of its instances. The network architecture becomes more complex but the math demands it.
The first architectural lesson of tight SLOs is that they cost more than the engineering team initially thinks. The redundancy is non-optional and the cost is real.
Caching
The second architectural lever the SLO drives is caching. A read-through cache turns a hard dependency into a soft one. The SLO can be defended even when an upstream is unreliable, as long as cache hit rates are high enough.
- Reduces dependency on slow upstream.: If 95% of reads hit the cache, only 5% reach the upstream. The effective dependency on the upstream is 5x less than the raw architecture suggests. A 99% upstream behind a 99.9% cache contributes about 99.85% to your composite SLO, much better than 99% raw.
- Decouples failure modes.: When the upstream fails, the cache continues serving for as long as its TTL allows. The team has minutes (or hours, depending on TTL) to fix the upstream before users notice. This buffer is what makes recovery possible without breaching SLO.
- Multi-tier caching.: Application-level cache (in-process), service-level cache (Redis, Memcached), edge cache (CDN). Each tier catches a fraction of the misses from the previous one. The composite hit rate is high enough that backend load is small even at high traffic.
- Cache freshness vs availability tradeoff.: Cached data may be stale. The SLO has to express tolerance for staleness explicitly: "95% of responses use data less than 60 seconds old." Without this dimension in the SLO, the cache lets you hit availability while silently degrading correctness.
- Cache itself becomes a dependency.: The cache layer is now on the critical path. Its own SLO has to be tighter than the service it is protecting. Most teams underinvest in cache reliability because it does not feel like infrastructure; tight SLOs make this investment unavoidable.
Caching is one of the highest-leverage architectural responses to tight SLO targets. It buys availability that would otherwise require much more expensive infrastructure investment.
Graceful degradation
The third architectural pattern is graceful degradation: partial functionality that is better than full failure. When something inevitably breaks, the service continues to serve users at a reduced level instead of returning errors.
- Partial functionality beats full failure.: A search service that returns results without personalization when the personalization backend is down is delivering 90% of the user-perceived value at 0% of the personalization availability cost. A service that fails completely because one optional dependency is unavailable is wasteful.
- Architectural, not just operational.: Graceful degradation has to be designed in. Each request path needs an explicit fallback for each non-essential dependency. The fallback can be a default response, a stale cached value, a simpler computation, or omitting the affected feature entirely. Each fallback is a deliberate design choice.
- Modes named in the design.: The service has documented operating modes: full, read-only, cached-only, static, down. Each mode is triggered by specific conditions and produces specific behavior. The modes are part of the architecture, not an afterthought.
- Modes exposed in the response.: When the service is in a degraded mode, the response includes a header indicating it. Clients can adapt; users can be shown a banner; the operations team can see the mode in dashboards. Silent degradation is worse than visible degradation.
- SLO budget per mode.: Some teams compute SLO compliance separately for each mode. Full mode has the strictest target; degraded modes have looser targets reflecting reduced functionality. The budget math reflects the actual user experience rather than collapsing everything into one number.
The architecture that emerges from tight SLO targets is more resilient, more expensive, and more deliberate than the architecture that emerges without them. Nova AI Ops tracks the SLO contribution of each architectural component, surfaces the cases where the architecture cannot support the committed target, and helps engineering leadership make the build-or-buy decisions that flow from the reliability commitment.