SLOs for Async and Batch Workloads

Most SLO advice assumes request-response. Async + batch need different SLI shapes; the patterns are well-known but rarely written down.

Why request-SLOs do not fit

Async services process messages eventually; batch jobs run on a schedule. "Request success rate" does not describe either; the question for async and batch is "did the work get done in time?"

Async services. Messages processed eventually; latency means queue lag, not request RTT; success means processed at all.
Batch jobs. Run on schedule; a "successful request" is meaningless; success means completed within the window.
Request rate is the wrong shape. A batch job with one daily request and 100% success rate tells you nothing about reliability.
The right shape. Freshness, completeness, timeliness, correctness; four SLI shapes that fit async and batch reality.

Four async/batch SLI shapes

1. Freshness: data not older than X.
2. Completeness: N% of expected items processed.
3. Timeliness: jobs completed within scheduled window.
4. Correctness: outputs match expected.

Examples per shape

Each SLI shape maps cleanly to a workload pattern. The mapping makes SLO definition mechanical; pick the workload, pick the matching shape.

Stream processor. Freshness SLO: "data lag < 30s for 99% of the month"; matches user expectation of "live."
Nightly ETL. Timeliness SLO: "completes by 6am 99.5% of nights"; matches downstream consumer dependency.
ML inference batch. Completeness plus correctness: "99.9% of expected predictions written, 99% within tolerance."
Event consumer. Freshness plus completeness: "lag < 5 minutes AND every event processed within 1 hour."

Combining shapes per service

Most async services need 2-3 SLI types simultaneously. Latency alone misses the actual user-visible failures; combining freshness, completeness, and correctness covers the workload shape.

Two or three SLIs is normal. One SLI rarely captures async reliability; combine the shapes the workload actually has.
Define each SLI in code. The metric emission lives with the workload; avoids drift between definition and measurement.
Emit consistently. Same names, same labels, same intervals; supports cross-service comparison and reuse.
Document the rationale. Per-SLI: what user behaviour it models; why this threshold; what business outcome it protects.

Antipatterns

Request-only SLOs for batch. Wrong metric.
One SLI for everything. Misses real failures.
SLO matched to job runtime. Tightens as jobs grow.

What to do this week

Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.