SLOs on Batch Jobs
Batch jobs need duration SLOs.
Duration
Most SLO frameworks were designed around request-response services where every call is independent and the question is "did this single call succeed?" Batch jobs do not fit that shape. A nightly ETL is one big operation that either finishes on time or does not. Latency is not an interesting metric; total duration is. The right SLO for a batch job is a duration target measured per run.
What duration SLOs look like:
- 95% of runs complete within X minutes.: Pick X based on the consumer's tolerance. A daily report consumed at 9 AM tolerates a job that finishes by 6 AM but not one that finishes at 10 AM. The target is "complete by the published deadline" expressed as a percentage over a rolling window.
- Predictable beats fast.: A job that runs in 45 minutes most days and 4 hours occasionally is harder to depend on than one that always runs in 90 minutes. Reduce variance first, optimize the median second. Downstream consumers care about predictability more than they care about best-case speed.
- Per-job ID.: Each scheduled job has its own duration SLO based on its own consumer expectations. The nightly billing rollup has a different deadline than the hourly cache refresh. Targets are per job, not per pipeline.
- Window-based, not per-run.: The SLO is computed over 30 days, not on each individual run. A job that occasionally misses its deadline but mostly hits is acceptable; the SLO measures the percentage that hit. A job that always hits but at 55% margin is borderline; the trajectory matters.
- Variance as a leading indicator.: The variance of run durations is itself a signal. A job whose median is stable but whose tail is growing is heading toward SLO breach. Tracking the variance alongside the median catches drift before it crosses the deadline.
Duration SLOs reframe the conversation about batch jobs from "did it work?" to "did it work in time?" The latter is what consumers actually care about and what the SLO should commit to.
Frequency
Duration SLOs miss the failure mode where the job did not run at all. A scheduled job that simply did not fire on Tuesday produces no data, no errors, no signal that anything is wrong. The frequency SLO catches this.
- Job runs as scheduled.: The denominator is the number of times the job was supposed to run in the window. The numerator is the number of times it actually ran. The ratio is the frequency SLO. A job that should run every hour and ran 168 times last week (24 times 7) was 100% frequency-reliable.
- No skipped runs.: Even runs that started but failed before completion count as failures for both duration and frequency. The most common batch failure is the run that did not happen because the scheduler was misconfigured, the upstream dependency was down, or the credentials had expired silently.
- Reliability of the schedule itself.: Frequency SLOs catch issues with the orchestration layer (Airflow, dbt, custom cron) that duration SLOs miss. A scheduler with a 95% trigger rate is one that loses 5% of expected work, which is a much larger problem than a 5% tail latency.
- Per-job and aggregate.: Track frequency per job for direct accountability. Aggregate across jobs in a pipeline to get the pipeline's overall reliability. Both views matter; both surface different failure patterns.
- Alert on consecutive misses.: A single missed run is sometimes recoverable; multiple missed runs in a row is a structural problem. Alerts fire on the consecutive-miss pattern, not on individual misses, because individual misses are noisy and consecutive misses are reliably bad news.
Frequency is the second axis of batch reliability. Without it, jobs that quietly stop running stay invisible until a downstream consumer notices their data is two weeks stale.
Output
The third axis is output validation. A job that ran on time and hit its frequency target but produced wrong output is a worse failure than one that did not run, because consumers will trust and use the bad data without realizing it is wrong.
- Output validated against schema.: Every output gets validated against its declared schema before being published. Missing columns, wrong types, null where non-null is expected. Schema-level invariants catch the obvious failures cheaply at the publish step.
- Sample correctness.: Spot-check a small sample of output rows against expected invariants. Sum of partition equals expected total. Every customer has at least one row. No row crosses a logical boundary it should not. Sampling catches semantic bugs that schema validation alone misses.
- Data quality as part of the SLO.: The output validation rate is part of the batch SLO. A job that ran on time, ran on schedule, but produced invalid output failed by SLO definition. The composite metric reflects the full chain.
- Diff against yesterday.: Compare today's output to yesterday's at the aggregate level. Row count within tolerance. Key counts within tolerance. No impossible distribution shifts. A pipeline that suddenly produces 10x or 0.1x the expected volume needs to be flagged before downstream systems consume it.
- Quarantine on validation failure.: When output validation fails, the data is held back from publication, the on-call is paged, and downstream consumers continue using the previous successful output. The quarantine is what protects consumers from acting on bad data.
Duration, frequency, and output quality are the three dimensions of a batch SLO that actually reflects whether the system is doing its job. Nova AI Ops tracks all three per pipeline, computes the SLO compliance per dimension, and pages on coverage gaps and output anomalies before downstream consumers build dashboards on top of bad data.