SLOs for Async and Batch Workloads
Most SLO advice assumes request-response. Async + batch need different SLI shapes; the patterns are well-known but rarely written down.
Why request-SLOs do not fit
Async: messages processed eventually. Batch: jobs run on schedule.
‘Request success rate’ doesn’t apply; the question is ‘did the work get done in time.’
Four async/batch SLI shapes
- 1. Freshness: data not older than X.
- 2. Completeness: N% of expected items processed.
- 3. Timeliness: jobs completed within scheduled window.
- 4. Correctness: outputs match expected.
Examples per shape
Stream processor: freshness SLO (‘data lag < 30s’).
Nightly ETL: timeliness SLO (‘completes by 6am’).
ML inference batch: completeness + correctness SLO.
Combining shapes per service
Most async services need 2-3 SLI types. Latency alone misses the actual user-visible failures.
Document each SLI’s definition in code; emit consistently.
Antipatterns
- Request-only SLOs for batch. Wrong metric.
- One SLI for everything. Misses real failures.
- SLO matched to job runtime. Tightens as jobs grow.
What to do this week
Three moves. (1) Apply the pattern to your most-impactful service. (2) Measure adherence for 30 days. (3) Rewrite the policy or the SLO if the gap is durable.