SLOs for Async and Batch Workloads
Most SLO advice assumes request-response. Async + batch need different SLI shapes; the patterns are well-known but rarely written down.
Why request-SLOs do not fit
Async services process messages eventually; batch jobs run on a schedule. "Request success rate" does not describe either; the question for async and batch is "did the work get done in time?"
- Async services. Messages processed eventually; latency means queue lag, not request RTT; success means processed at all.
- Batch jobs. Run on schedule; a "successful request" is meaningless; success means completed within the window.
- Request rate is the wrong shape. A batch job with one daily request and 100% success rate tells you nothing about reliability.
- The right shape. Freshness, completeness, timeliness, correctness; four SLI shapes that fit async and batch reality.
Four async/batch SLI shapes
- 1. Freshness: data not older than X.
- 2. Completeness: N% of expected items processed.
- 3. Timeliness: jobs completed within scheduled window.
- 4. Correctness: outputs match expected.
Examples per shape
Each SLI shape maps cleanly to a workload pattern. The mapping makes SLO definition mechanical; pick the workload, pick the matching shape.
- Stream processor. Freshness SLO: "data lag < 30s for 99% of the month"; matches user expectation of "live."
- Nightly ETL. Timeliness SLO: "completes by 6am 99.5% of nights"; matches downstream consumer dependency.
- ML inference batch. Completeness plus correctness: "99.9% of expected predictions written, 99% within tolerance."
- Event consumer. Freshness plus completeness: "lag < 5 minutes AND every event processed within 1 hour."
Combining shapes per service
Most async services need 2-3 SLI types simultaneously. Latency alone misses the actual user-visible failures; combining freshness, completeness, and correctness covers the workload shape.
- Two or three SLIs is normal. One SLI rarely captures async reliability; combine the shapes the workload actually has.
- Define each SLI in code. The metric emission lives with the workload; avoids drift between definition and measurement.
- Emit consistently. Same names, same labels, same intervals; supports cross-service comparison and reuse.
- Document the rationale. Per-SLI: what user behaviour it models; why this threshold; what business outcome it protects.
Antipatterns
- Request-only SLOs for batch. Wrong metric.
- One SLI for everything. Misses real failures.
- SLO matched to job runtime. Tightens as jobs grow.
What to do this week
Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.