Availability vs Correctness Trade-off
Sometimes correctness matters more than availability.
Correctness
One of the deepest reliability decisions a service makes is the trade-off between availability and correctness. Sometimes serving a wrong answer is worse than serving no answer; sometimes the reverse. Most teams handle this trade-off implicitly; mature teams handle it explicitly, with documented decisions per service and clear behaviors when the trade-off is forced.
What "correctness over availability" means:
- Wrong answer is worse than no answer.: For some workloads, returning incorrect data is worse than returning an error. The customer who got a wrong financial transaction confirmation is worse off than the customer who saw "service temporarily unavailable, retry in a moment." The wrong answer leads to bad downstream decisions; the error leads to a retry.
- Financial systems prioritize correctness.: Payment processing. Account balances. Transaction confirmations. Each of these has to be right or the financial relationship breaks. Returning approximate or stale balances is worse than refusing to serve until the authoritative system is reachable.
- Refuse to serve over wrong serve.: The service explicitly fails closed when it cannot guarantee correctness. The user gets a clear error; they retry; they may experience a brief outage; they do not get an incorrect answer that they act on.
- Trade availability budget for correctness budget.: The SLO acknowledges this trade. The service may have a 99% availability SLO with a 99.99% correctness SLO. The team prefers being unavailable to being incorrect; the SLO encodes the preference.
- Document the choice.: The service's documentation explicitly states the correctness preference. Customers integrating with the service know what to expect when issues arise. The integration patterns reflect the choice.
Correctness-over-availability is the right choice for services where wrong answers compound into worse outcomes than brief unavailability.
Availability
The reverse trade-off is also valid for the right workloads. Some services prefer to serve approximate or slightly-stale data rather than fail. The user who sees yesterday's recommendations is better off than the user who sees an error page; the user who gets a cached search result is better off than the user who gets nothing.
- Approximate answer worse than no answer? Sometimes.: The framing depends on the workload. For some workloads, the approximation is fine; for others, it is wrong. The team has to think through which case applies.
- Read-only mode under pressure.: When the write path is broken but the read path works against cached data, serve reads from cache. The data may be stale; the customer experience is partial but functional. The team accepts staleness for availability.
- Cached results when fresh data is unavailable.: A search service whose backend is down can serve cached search results. The results are stale; they are also useful. The customer's choice is "stale results or no results"; many prefer stale.
- Default values when computation fails.: A recommendation service that cannot compute personalized recommendations can fall back to popular items. The recommendations are not personalized; they are at least useful.
- Document the staleness contract.: When availability is preferred over correctness, the service documents what "fresh" means and what "stale" means. Customers integrating with the service know that responses may be up to N seconds stale; they integrate accordingly.
Availability-over-correctness is the right choice for services where partial functionality is more useful than full failure.
Decide
The decision between availability and correctness is per service, not per company. Different services have different trade-offs because their workloads have different characteristics. The discipline is making the choice deliberately rather than letting it emerge from how the code happens to be written.
- Per service decision.: Each service decides for itself. Payment service: correctness. Search service: availability. Recommendation service: availability. Customer support knowledge base: probably availability. The choices are documented; new services make the choice deliberately at design time.
- Match the user expectation.: The user's expectation of the service informs the choice. Users of payment services expect the answer to be right; brief unavailability is acceptable. Users of search services expect a quick response; brief staleness is acceptable.
- Document the choice.: The choice goes in the service's documentation. Operations teams reference it during incidents; product teams reference it during roadmapping; customer success references it during customer conversations.
- Encode in the architecture.: The choice shapes the architecture. Correctness-preferring services have circuit breakers that fail closed; availability-preferring services have caches and fallbacks. The code matches the choice.
- Reconsider periodically.: The choice may evolve. As the service matures or as the business shifts, the trade-off may change. Annual review of the choice catches the cases where the original assumption no longer holds.
Availability versus correctness is one of those design decisions that compounds across years of operation. Nova AI Ops surfaces both metrics per service, supports SLOs that combine both dimensions, and tracks the architectural patterns that match each service's chosen trade-off.