SLO & Reliability Practical By Samson Tanimawo, PhD Published Jan 16, 2026 4 min read

SLOs on Batch Jobs

Batch jobs need duration SLOs.

Duration

Most SLO frameworks were designed around request-response services where every call is independent and the question is "did this single call succeed?" Batch jobs do not fit that shape. A nightly ETL is one big operation that either finishes on time or does not. Latency is not an interesting metric; total duration is. The right SLO for a batch job is a duration target measured per run.

What duration SLOs look like:

Duration SLOs reframe the conversation about batch jobs from "did it work?" to "did it work in time?" The latter is what consumers actually care about and what the SLO should commit to.

Frequency

Duration SLOs miss the failure mode where the job did not run at all. A scheduled job that simply did not fire on Tuesday produces no data, no errors, no signal that anything is wrong. The frequency SLO catches this.

Frequency is the second axis of batch reliability. Without it, jobs that quietly stop running stay invisible until a downstream consumer notices their data is two weeks stale.

Output

The third axis is output validation. A job that ran on time and hit its frequency target but produced wrong output is a worse failure than one that did not run, because consumers will trust and use the bad data without realizing it is wrong.

Duration, frequency, and output quality are the three dimensions of a batch SLO that actually reflects whether the system is doing its job. Nova AI Ops tracks all three per pipeline, computes the SLO compliance per dimension, and pages on coverage gaps and output anomalies before downstream consumers build dashboards on top of bad data.