Uptime Percentages: What They Actually Mean
99% vs 99.9% vs 99.99% vs 99.999%.
99%
Reliability targets get expressed as percentages. The percentages sound similar to each other (99%, 99.9%, 99.99%) but the operational reality of each is dramatically different. Translating the percentages into concrete time budgets is the discipline that makes them comprehensible. A 99% SLO is much less rigorous than a 99.9% SLO; the difference is an order of magnitude in what the team can absorb.
What 99% actually allows:
- About 3.65 days of downtime per year.: 1% of 365 days is 3.65 days. The team can be down for 3 and a half days across the year and still hit 99%. This is generous; even routine maintenance windows fit comfortably.
- About 7.2 hours per month.: Roughly a working day per month. Most internal services and many small-business SaaS products operate at this level. The reliability target is real but generous.
- Loose by enterprise standards.: Enterprise customers expect more than 99%. A B2B vendor publishing a 99% SLA will lose enterprise procurement decisions to competitors offering 99.9% or better. The target signals a particular customer segment.
- Achievable on single-region single-AZ.: 99% can be hit without redundant architecture. A single AZ with sensible operational practices typically runs above 99% without special investment. The architectural cost of 99% is essentially the architectural cost of running anything.
- Right for: internal tools, dev environments, free tier.: Services where the cost of failure is small and the customer base does not require a tight commitment. 99% is the right target for things that just need to work most of the time.
99% is the floor of meaningful SLO commitment. Below it, the team is essentially making no commitment; at it, the commitment is real but generous.
99.9%
The middle tier of common SLO targets. 99.9% is what most production B2B SaaS commits to. It requires meaningful operational investment but does not require multi-region architecture. It is the sweet spot where the engineering cost is manageable and the customer commitment is taken seriously.
- About 8.76 hours of downtime per year.: 0.1% of 365 days is roughly nine hours. The team has a working day plus a bit across the year to absorb incidents and breaches. This is meaningful but not luxurious.
- About 43 minutes per month.: Less than an hour. A single AZ failure can consume the entire month's budget. The team has to operate carefully to avoid breach.
- Standard for production B2B SaaS.: Most production B2B SaaS publishes a 99.9% SLA. Customers in this segment expect this level of commitment; vendors offering less lose deals; vendors offering more get to charge premium prices.
- Requires multi-AZ deployment.: The math demands redundancy across availability zones. A single-AZ architecture cannot reliably hit 99.9% because a single AZ outage exhausts the monthly budget. Multi-AZ is the architectural floor.
- Requires real on-call discipline.: Incident response under 30 minutes during business hours; under 60 minutes after hours. The on-call rotation is staffed; alerts are tuned; runbooks are tested. The operational practice is real.
99.9% is the standard. The architectural and operational investment to hit it is meaningful but not extreme. Most teams that have decided to take reliability seriously land here.
99.99%
The tight tier. 99.99% is what payment processors, identity providers, and revenue-critical infrastructure commit to. The investment to hit it is materially larger than 99.9%. Many teams that aspire to 99.99% cannot actually deliver it; the gap between aspiration and capability is where chronic SLO miss lives.
- About 52 minutes of downtime per year.: 0.01% of 365 days. Less than an hour across the entire year. Almost no margin for any kind of significant incident.
- About 4 minutes per month.: A single network blip can consume the whole month. The team operates with much less room than 99.9%; one bad deploy ends the month's budget.
- Tight, requires multi-region.: Single-AZ cannot hit 99.99%. Multi-AZ struggles. Multi-region is essentially required because regional failures (which happen) cannot fit within the budget. The architectural investment is substantial.
- Tight requires automated rollback.: Mean time to detect plus mean time to recover has to be under 4 minutes. No human can detect, decide, and roll back in that window. Automation is required; the team relies on the system rather than on its own response.
- Right for: payment processors, identity infrastructure, revenue-critical APIs.: Services where any minute of downtime has real revenue or trust consequences. The cost of 99.99% is justified by the cost of failing it.
The percentages get smaller; the engineering investment grows by orders of magnitude. Each additional "9" is roughly 10x the operational cost of the previous one. Nova AI Ops tracks per-service SLO targets, computes the burn budget in concrete time units, and surfaces the cases where the team's architecture is mismatched against the committed target so the conversation about realistic reliability is informed by data.
99.999%
~5 min/year. ~26 sec/month.
Extreme.