SLO & Reliability Practical By Samson Tanimawo, PhD Published Aug 4, 2025 4 min read

SLOs for Batching Systems

Batches: per-job and per-day SLOs.

Per-job

SLOs designed for request-response services translate badly to batch systems. A nightly ETL job that finishes at 4 AM most days and at 7 AM occasionally is not "available" or "unavailable" in the request sense. It is on time or late, complete or partial, correct or wrong. The right framing is a per-job SLO that encodes those dimensions explicitly.

What a useful per-job SLO looks like:

The per-job SLO is the contract individual jobs make with their downstream consumers. It is the layer at which most batch SRE work happens.

Per-day

Per-job SLOs miss the case where every individual job ran fine but the system as a whole did not produce what it was supposed to. A daily rollup that depends on 20 hourly partitions, where 19 ran fine and one is missing, has a 95% per-job success rate and a 0% per-day success rate. The per-day SLO catches the gap.

The per-day SLO is the contract the batch system makes with the business. It is the number that should appear on the executive reliability dashboard, not the per-job rate.

Output quality

The third dimension is the one batch teams skip most often: the output's correctness, not just its existence. A job that ran on time and produced complete output but emitted invalid data is a worse failure than one that did not run at all, because consumers will trust and use the bad data without realizing it is wrong.

Per-job timing, per-day coverage, and per-output quality are the three dimensions of a batch SLO that actually reflects whether the system is doing its job. Nova AI Ops tracks all three per pipeline, computes the SLO compliance per dimension, and pages on coverage gaps and output anomalies before downstream consumers build dashboards on top of bad data.