Pipeline Freshness as SLO
Data freshness is a contract.
Define
Data pipelines fail differently from request-response services. They rarely return errors. They go quiet, they run slow, they produce stale outputs, and downstream consumers keep reading from yesterday's data without realizing anything is wrong. The fix is to treat freshness as a first-class SLO: a published, measured, alertable contract about how recent the data is allowed to be.
What a freshness SLO looks like in practice:
- Specific window.: "95% of hourly partitions arrive within 30 minutes of the partition's logical end time." Not "data is generally fresh." The numerator is partitions on time, the denominator is partitions due, the threshold is on minutes. Every term is measurable and every term means something to a downstream consumer.
- Per dataset, not per platform.: Customer event streams, billing rollups, ML feature stores, search indices, derived analytics tables. Each one has its own freshness contract because each one has its own consumers and its own tolerable lag. A platform-wide freshness SLO is too coarse to be useful.
- Tied to consumer impact.: The window comes from "how stale can this be before it hurts the user," not "how fast can the pipeline plausibly run." A search index where 2 hour staleness is invisible can have a 4 hour SLO. A real-time fraud signal cannot.
- Documented in the catalog.: Every dataset's freshness SLO is published in the data catalog next to its schema. Consumers see it before they integrate. This is the difference between freshness as a discipline and freshness as an oral tradition.
The act of writing the SLO is half the value. It forces the conversation about what staleness actually means for each dataset, which is a conversation most teams have never had explicitly.
Monitor
Once the freshness SLO is published, the pipeline needs continuous instrumentation against it. The biggest difference from a service SLO is that the signal is lag, not error rate, and lag has to be measured against the dataset's logical clock, not wall-clock alone.
- Per-pipeline lag.: Track the gap between the dataset's logical end time (the partition's "for" timestamp) and the moment that partition was actually written and committed. This is the only number that matters. Wall-clock duration of the job is a proxy at best.
- Continuous, not just on success.: The metric updates whenever a partition is expected, whether or not the pipeline has produced it. A pipeline that has not run for 4 hours but is "supposed to" run hourly is silently 4 hours stale, and the freshness metric must reflect that even when no job is failing.
- Per-stage breakdown.: When a pipeline goes slow, the lag should attribute to a stage (extract, transform, load, validate, publish). This is what makes the metric actionable. "Pipeline X is stale" is alert. "Pipeline X is stale because the validate stage took 3x normal" is a fix.
- Cross-pipeline rollup.: Show a freshness dashboard that lists every dataset, its current lag, its SLO, and whether it is on or off. This is the data team's equivalent of a service status page. Consumers come here when they suspect something is stale and the dashboard answers in one glance.
Continuous lag tracking is what turns freshness from "we'll find out tomorrow when the report looks weird" into a known, monitored property of the pipeline.
Alert
The alerting layer on freshness is harder to get right than it looks. The two failure modes both bite: alerts that fire on every short delay (noise that gets ignored) and alerts that only fire on hard failure (silence while data drifts staler over weeks).
- Page on sustained breach, not transient.: If a partition is 31 minutes late once and then catches up, that is normal jitter. If three partitions in a row are over the window, that is a pattern and worth waking someone for. Alert thresholds should require a sustained breach measured in partitions, not a single late one.
- Alert on slow drift, not just failure.: The most dangerous freshness failure is not a pipeline that broke. It is one that has been getting 5% slower every week for two months. Wire a long-window comparator (this week's median lag vs the last 4 weeks) so a drift toward the SLO surface is caught before it crosses the threshold.
- Differentiate "missed" from "late.": A partition that arrives 45 minutes late is one signal. A partition that did not arrive at all is a different one with different urgency. Both burn freshness budget, but the response is different. Categorize them in the alert metadata.
- Notify the consumers, not just the producers.: When a freshness SLO breaches, the team that owns the pipeline needs to know AND the downstream teams that read from it need to know. Half of the value of the SLO is letting consumers make informed decisions when their inputs are stale.
Freshness alerts done right catch drift before it becomes incident, and let consumers route around stale data instead of building reports on top of it. Nova AI Ops watches partition lag per dataset, computes the freshness SLO and burn rate continuously, and pages on sustained breach or slow drift before the staleness shows up in someone's quarterly board deck.