SLOs on ML Services
ML adds quality dimension to SLOs.
Dimensions
Standard SLO frameworks were built for stateless request-response services where success is mostly about availability and latency. Machine learning services break that frame. A model that responds in 100ms with the wrong answer is not "available" in any meaningful sense; the inference completed but the result is incorrect. The fix is multi-dimensional SLOs that include quality alongside the conventional dimensions.
What ML SLO dimensions actually cover:
- Latency.: The time from inference request to response. Same as any service, with the wrinkle that ML inference latency depends heavily on input characteristics (longer prompts, larger images, more retrieval candidates). The latency SLO has to account for this variance, often with separate targets per input class.
- Errors.: 5xx responses, model load failures, OOM crashes, GPU OOM specifically. Standard service availability metrics apply. ML services often have more error modes than typical services because the GPU layer and the model load mechanism add their own failure surface.
- Quality (model accuracy).: The dimension unique to ML. The model's outputs are correct, useful, or relevant according to the task definition. For a classifier, this is precision and recall. For a recommender, top-K accuracy. For a generative model, eval metrics that capture coherence, factuality, and policy compliance. Quality is not optional; it is part of "did the service do its job."
- All three together.: An ML service that has 99.9% availability, 100 ms p99 latency, and 75% accuracy is a service that is technically up and producing wrong answers a quarter of the time. The composite SLO has to require all three dimensions to be acceptable for a request to count as successful.
- Per-cohort quality.: Quality often varies by user cohort, language, region, or query type. Aggregate quality might look fine while specific cohorts (rare languages, edge-case queries) are at much lower quality. The dimension breakdown by cohort surfaces the cases the aggregate hides.
The dimensional model for ML SLOs is what turns "the model is up" into a meaningful claim about service quality. Without quality in the SLO, the team's reliability story is dangerously incomplete.
Track quality
The hard part of ML SLOs is operationalizing the quality dimension. Latency and errors come from the metric pipeline; quality requires labeled data, eval pipelines, and ongoing measurement against a moving production distribution. The discipline is more involved but the techniques are well-established.
- Eval metrics in production.: Run quality evals against production traffic continuously. A held-out labeled set evaluated daily. A subset of production queries replayed against the model and compared to ground truth. Online metrics derived from user behavior (click-through, time-on-result, follow-up queries). Each is a partial signal; together they build a quality dashboard.
- Drift detection.: The production input distribution shifts over time. New user cohorts, new query patterns, new edge cases. The model's quality on the shifted distribution may be different from its quality on the eval set. Drift detection compares production input statistics to the training distribution and flags significant shift.
- Per-segment evaluation.: Quality evaluation runs separately per important segment (language, region, account tier, query type). The aggregate is one number; the per-segment view surfaces where quality is lower than the headline. Many ML quality issues are concentrated in cohorts the aggregate hides.
- Human-in-the-loop verification.: Some quality dimensions require human judgment (factuality of generated text, relevance of recommendations to ambiguous queries). A small ongoing labeling effort produces the ground truth that the eval metrics calibrate against. The investment is real but small compared to building the model.
- Alert on quality regression.: When quality drops below a threshold (or its drift exceeds a threshold), an alert fires. The alert routes to the team that owns the model, not just the SRE on call. ML quality regressions have a different remediation path (retrain, fall back to previous model, gate by cohort) than infrastructure issues.
Quality tracking turns ML SLOs from theory into practice. Without it, the quality dimension is asserted but not measured, and the team finds out about regressions from customer complaints rather than from the dashboard.
Compound
Multi-dimensional ML SLOs require more instrumentation than single-dimensional service SLOs. The investment is justified because the workload's failure modes are themselves multi-dimensional. The compound view is what makes the SLO useful.
- Multi-dim SLO for ML services.: The composite SLO requires all dimensions to be within target for a request to count as successful. Latency above target, error returned, or quality below threshold all count as failures. The error budget burns from any of the three. This is the property that makes the SLO match the user experience.
- Match the workload.: Generic SLO frameworks designed for HTTP APIs do not capture ML-specific failure modes. The team has to extend the framework to include quality. The investment is small (a quality SLI alongside the existing ones) but the result is dramatically more honest.
- Per-tier targets.: Tier 0 ML services (production fraud detection, content recommendations on the homepage) need tight quality SLOs because users see the impact directly. Tier 2 ML services (internal classifiers used in batch jobs) can have looser quality SLOs. The tiering matches the consequence of quality failure.
- Quality SLO drives retraining cadence.: When the quality SLO is at risk, the response often involves retraining the model. The SLO becomes the operational signal that drives the ML team's training pipeline. Retraining without an SLO signal is calendar-driven; with one, it is data-driven.
- Honest with customers.: ML services that publish quality SLOs alongside availability SLOs are uncommon and trustworthy. The customer who is paying for an inference service wants to know how often the service produces useful outputs, not just whether the API is up. Publishing the quality dimension differentiates the vendor.
Multi-dimensional ML SLOs are the discipline that makes ML systems operationally honest. Nova AI Ops integrates with eval pipelines, tracks quality alongside latency and errors per ML service, and surfaces the per-segment quality breakdown so the team can see where the model is actually working and where it is not.