SLOs for Batching Systems
Batches: per-job and per-day SLOs.
Per-job
SLOs designed for request-response services translate badly to batch systems. A nightly ETL job that finishes at 4 AM most days and at 7 AM occasionally is not "available" or "unavailable" in the request sense. It is on time or late, complete or partial, correct or wrong. The right framing is a per-job SLO that encodes those dimensions explicitly.
What a useful per-job SLO looks like:
- 95% of jobs complete on time.: "On time" means before a published deadline (the service-level deadline, SLD), measured per job ID. A nightly job with a 6 AM deadline either finished by 5:59 AM or it did not. Aggregating across 30 days gives the success rate.
- Reliability before optimization.: A pipeline that runs in 45 minutes most nights and 4 hours occasionally is less useful than one that always runs in 90 minutes. Reduce variance first, optimize the median second. Downstream systems depend on predictability, not on best-case speed.
- Treat partial completion as failure.: If 90% of the data lands but 10% is missing, the job did not succeed. Partial outputs that look complete are how downstream consumers silently learn to distrust the pipeline. Either everything completed or the job failed.
- Per-job retry budget.: The SLO measures end-to-end success, including retries. A job that needed three retries to succeed inside the deadline is still on time but it is consuming retry budget that limits how many simultaneous failures the system can absorb.
The per-job SLO is the contract individual jobs make with their downstream consumers. It is the layer at which most batch SRE work happens.
Per-day
Per-job SLOs miss the case where every individual job ran fine but the system as a whole did not produce what it was supposed to. A daily rollup that depends on 20 hourly partitions, where 19 ran fine and one is missing, has a 95% per-job success rate and a 0% per-day success rate. The per-day SLO catches the gap.
- All required batches present per day.: The denominator is the count of jobs that were supposed to run today (per a schedule). The numerator is the count that produced acceptable output. The ratio is the per-day success metric. Anything less than 100% per day is a coverage gap.
- Coverage failures escalate fast.: Missing one of 20 hourly batches is often the start of a cascading failure (the next hour cannot proceed without the previous one). Per-day SLO breaches need to fire louder than per-job ones because the impact compounds.
- Distinguish from per-job rate.: A 95% per-job success rate sounds healthy, but if those failures cluster on critical jobs the per-day picture can be much worse. Track both numbers; do not let the per-job rate hide a coverage problem.
- Cross-system per-day rollup.: When multiple pipelines feed a daily reporting system, the per-day metric is computed across all of them. If pipeline A succeeded but pipeline B did not produce its required input, the day failed end-to-end even though A's per-day rate is 100%.
The per-day SLO is the contract the batch system makes with the business. It is the number that should appear on the executive reliability dashboard, not the per-job rate.
Output quality
The third dimension is the one batch teams skip most often: the output's correctness, not just its existence. A job that ran on time and produced complete output but emitted invalid data is a worse failure than one that did not run at all, because consumers will trust and use the bad data without realizing it is wrong.
- Schema validation.: Every output gets validated against its declared schema before being published. Missing columns, wrong types, null where non-null is expected. Schema-level invariants catch the obvious failures cheaply.
- Sample correctness checks.: Spot-check a small sample of output rows against expected invariants (sum of partition equals expected total, every customer has at least one row, no row crosses a logical boundary it should not). Sampling catches semantic bugs that schema validation misses.
- Diff against yesterday.: Compare today's output to yesterday's at the aggregate level. Row count within tolerance, key counts within tolerance, no impossible distribution shifts. A pipeline that suddenly produces 10x or 0.1x the expected volume needs to be flagged before downstream systems consume it.
- Quarantine on validation failure.: When a check fails, the output is held back from publication, the on-call is paged, and downstream consumers continue using the previous successful output. This is the difference between a data quality SLO that protects consumers and one that just tells them after the fact what went wrong.
Per-job timing, per-day coverage, and per-output quality are the three dimensions of a batch SLO that actually reflects whether the system is doing its job. Nova AI Ops tracks all three per pipeline, computes the SLO compliance per dimension, and pages on coverage gaps and output anomalies before downstream consumers build dashboards on top of bad data.