SLO Baseline Data Quality
Bad baseline data = wrong target.
Issues
Setting an SLO target on bad baseline data is one of the most expensive mistakes in reliability engineering. The team picks a number based on what the past 30 days looked like, the past 30 days were instrumented incorrectly, the SLO is now either trivially met or impossibly hard, and nobody finds out for a quarter. The fix is to audit the data before you set the target, not after the dashboard is wrong.
The recurring data quality issues that poison baselines:
- Missing metrics.: The instrumentation that should have captured a failure mode silently does not. Health checks that only probe one path. Latency that excludes the worst tail because the histogram bucket cap is too low. Error counters that filter by status code in a way that hides certain failures. Each of these makes the baseline look better than reality.
- Outliers without explanation.: A spike in the baseline data that the team cannot explain (a 30 minute window of impossibly low traffic, an hour of zero errors, a sudden drop in latency that has no root cause). Outliers either reflect real but unexplained events that need investigating, or they reflect instrumentation gaps. Either way, baseline includes them at face value at your peril.
- Sampling bias.: Aggregating across regions, customer tiers, or traffic types in a way that the average looks fine but no single segment matches it. A service that is 99.99% available for paying customers and 95% available for free users averages to 99% in the baseline, and an SLO set at 99% will make neither group happy.
- Survivorship bias.: Only requests that completed get counted, including the count of "successful" ones. Requests that timed out before reaching the metric collector do not show up in either numerator or denominator, which makes the success rate look artificially high.
- Time-of-day skew.: The baseline pulls 30 days that include weekends, holidays, and a quiet maintenance window. The SLO target gets set on this softer mix. When real-world load shifts the mix, the SLO becomes harder.
The audit is a one-week investment that prevents three months of dashboard arguments. It is also the cheapest thing the team will do all quarter.
Validate
Once you have surfaced the candidate baseline, validate it against independent sources before committing. The cross-check is the cheap insurance against the case where one source is systematically wrong.
- Cross-check with synthetic probes.: If your error rate from internal metrics says 0.05%, an external synthetic probe hitting the same endpoints from outside the firewall should agree within sampling tolerance. A divergence is a signal that one of them is wrong, and it is usually the internal metric.
- Cross-check with logs.: Count failed requests from the load balancer access log and compare to the failure count in the metric. They should match. When they do not, you have found a failure class the metric is missing.
- Cross-check with the customer signal.: Support tickets, status page autopilot detectors, customer-reported incidents. If your baseline says the service was 99.99% in the past 30 days but the customer support team logged five incidents that match that period, your baseline is missing something real.
- Cross-check across regions.: If the service runs in multiple regions and each region's baseline tells a different story, the target needs to be set per region, not as a global average. Otherwise you guarantee one region is overcommitted and another is undercommitted.
- Walk the data with the team.: Before locking the SLO, the team that owns the service should be able to look at the baseline plot and say "yes, that looks right" with a straight face. If they hesitate, dig more.
Validation builds confidence not just that the baseline is right but that the team understands what it is committing to. That confidence is the foundation under any SLO that is going to survive contact with reality.
Compound
Data quality compounds in both directions. Good data feeds good SLOs which feed informed decisions which feed better instrumentation. Bad data feeds wishful SLOs which feed surprises which feed the temptation to ignore the dashboard altogether.
- Trustworthy SLOs require trustworthy data.: The team will respect the SLO only as long as the data behind it lines up with their experience. The first time a customer reports an outage that the SLO dashboard missed, trust drops, and the dashboard becomes harder to defend in every future conversation.
- Investment in instrumentation pays asymmetrically.: A bug found in the metric pipeline saves the team from arguing about a wrong number for years. Every hour spent improving how the SLI is collected is worth more than ten hours of post-hoc explanation.
- The team that can produce a clean SLI on demand has higher operational maturity.: The instrumentation discipline that makes baseline audits easy is the same discipline that makes incident retros easy, that makes capacity planning easy, that makes regulator audits easy. It is a force multiplier, not a one-shot investment.
- Bad data hides regressions.: A baseline that systematically over-counts successes will keep over-counting them after a regression lands. The team will not see the dip until customers complain, by which time the damage is done.
Treat baseline data quality as a prerequisite to setting an SLO, not as a stage you skip past to get to the dashboard. Nova AI Ops audits SLI pipelines for the common quality issues (missing metrics, sampling bias, time-of-day skew, survivorship), cross-checks against synthetic probes and access logs, and flags the cases where the baseline looks suspicious before you commit a target the data cannot back up.