SLO Targets by Service Stage
New services have different SLOs.
Development
The most common mistake teams make with new services is publishing an aggressive SLO target before there is enough data to know what is achievable. The service has not soaked under real load, the dependencies are still in flux, the architecture has bugs nobody has found yet. Setting a 99.9% target on a service that has been live for three weeks is a commitment based on hope. The right answer is staged SLOs that mature with the service.
What development-stage reliability looks like:
- No SLO yet.: A service still in development is iterating on its core behavior. The endpoint that exists today may be replaced next sprint. The data path may be redesigned. Setting an SLO target here is committing to behavior that has not stabilized.
- Iterating freely.: Engineering can deploy any change at any time without the deploy-time gates that production-grade SLOs require. Speed of iteration is the goal; reliability comes later. The team optimizes for learning, not stability.
- No customer promise.: The service is not customer-facing in any commit-able sense. If it has any users, they are internal engineers who can reach the dev team directly when something is wrong. There is no SLA, no status page, no public commitment.
- Instrumentation is in place.: Even without an SLO, the metric pipeline is wired up from day one. Latency histograms, error counters, success rates. The data is being collected so that when the service moves to beta, the baseline analysis is already done.
- Reliability work is exploratory.: Adding tests, building runbooks, hardening dependencies. Not because there is a contract to defend but because the service will need them once it has users. Doing this work in development is much cheaper than retrofitting in production.
Development-stage services do not lack reliability discipline. They lack reliability commitments. The discipline (instrumentation, testing, runbooks) is in place; the contract (SLO target) is not.
Beta
Beta is the stage where the service has real users (internal beta-testers, early customers, design partners) but the team has not yet committed to maintaining the service at production-grade reliability. The right framing is a soft SLO: a target the team aims for but does not enforce contractually.
- Soft SLO.: Pick an internal target based on what the service can plausibly hit during beta. 99% is a reasonable starting point. The target is published internally, tracked on the dashboard, but not on any customer-facing SLA page.
- Tracked but not enforced.: If the service misses 99% in a given month, the team takes that as information, not as a contractual breach. Reliability work gets prioritized based on the gap, but customers do not get service credits for misses.
- Building data.: Beta is where the long-tail data accumulates: weekend traffic patterns, dependency rare-failure modes, edge-case bugs. The data is what eventually justifies the GA SLO target. Skip beta, set GA targets blind.
- Customer-set expectations.: Beta customers know the service is beta. The terms include "we do not commit to specific reliability targets during the beta period." The expectation is set in writing so the relationship is honest from the start.
- Burn-rate alerts active.: Even though the SLO is not contractual, alerts on burn rate are wired up. The team responds to incidents during beta the same way they will in GA. The practice is in place even when the obligation is not.
Beta is the dress rehearsal. The team learns to operate the service, the data accumulates, the SLO becomes defensible. By the time the service ships GA, the team knows what they are committing to.
GA
General availability is when the service ships with a full reliability commitment: a published SLO, an SLA if applicable, an error budget policy, on-call coverage, status page presence. The discipline that was developed in beta is now load-bearing. The SLO becomes a contract.
- Full SLO with policy.: The published target is committed. The error budget policy specifies what happens on burn (deploy freeze, reliability sprint). The on-call rotation is staffed for the SLO's response-time requirements. The runbook is tested. The whole apparatus is real.
- Customer commitment.: The SLA goes on the public docs. Customers reference it in procurement. The team's reputation is now staked on hitting it.
- Reliability investment continues.: Going GA is not the end of the work. The reliability practice continues quarterly: gap analysis, investment prioritization, target reassessment. GA is the steady state, not the finish line.
- Status page integration.: Incidents that affect the SLO get posted on the public status page within minutes. Quarterly performance is reported publicly. The transparency layer is part of the GA commitment.
- Reset the discipline at scale.: Each major scale milestone (10x users, 10x traffic, multi-region expansion) prompts a re-baselining of the SLO. The target that worked at smaller scale may need adjustment at larger scale; the discipline of recalibrating is what keeps the commitment honest.
SLO targets that mature with the service stage produce commitments the team can keep. SLO targets set day one of a new service produce broken promises by quarter-end. Nova AI Ops tracks per-service stage (development, beta, GA), suggests appropriate SLO posture for each, and surfaces the readiness signals (data accumulation, incident rate stability, on-call response capability) that indicate a service is ready to advance to the next stage.