SLOs on Data Pipelines
Pipelines need different SLOs than APIs.
Three pipeline SLO dimensions
Data-pipeline SLOs need three dimensions, not one. Freshness captures lag; completeness captures dropped records; correctness captures wrong data. Each dimension catches a different failure mode and demands a different mechanism.
- Freshness. Data-age metric per pipeline. Critical for downstream consumers that depend on recent data.
- Completeness. Records-processed metric per pipeline. Drops indicate upstream issues or transformation bugs.
- Correctness. Sample-based output validation. Hardest to measure but the dimension that prevents silent garbage.
- Documented per-dimension targets. Per-pipeline the explicit SLO values. Honest reporting requires explicit numbers.
Freshness SLO mechanics
Freshness expresses in time. “95 percent of records arrive within 30 minutes” is the standard form: specific, comparable across pipelines, and tied to consumer need.
- Express in time. “95 percent within 30 minutes” form. Specific and comparable.
- Continuous lag metric. Per-pipeline lag tracked as a standing signal. Alert when sustained lag exceeds threshold.
- Match SLO to consumer. Real-time consumers under 5 minutes; daily reports under 1 hour; nightly batches same-day.
- Burn-rate alarm. Lag-trend watch catches degrading freshness early, before the SLO breach.
Completeness SLO mechanics
Completeness compares actual records processed to expected. Drops have many causes; the worst class is silent drops via swallowed exceptions. Dead-letter audits surface what try/except logging hides.
- Actual versus expected counts. Records-processed compared per partition. Alert on shortfall.
- Causes of incompleteness. Upstream-missing, schema-validation-drop, transformation-error, dead-letter. Each needs a different fix.
- Monthly dead-letter audit. DLQ review per month. Silent drops via try/except are the worst pattern; fix at source.
- No-silent-drop rule. Explicit error path per pipeline. Catches the data loss that does not announce itself.
Operating pipeline SLOs
Operating pipeline SLOs is its own discipline. Standing dashboard, autoscale coupled to lag, quarterly review against workload growth.
- Per-pipeline dashboard. All three dimensions plus burn-rate on one view. Single source of truth.
- Couple lag SLO to autoscale. Lag-breach triggers scale-up. Removes human latency from response.
- Quarterly SLO review. Workload grows; old SLOs become tight. Re-baseline against actual capacity.
- Named owner per pipeline. Responsible team explicit. Catches the “everyone-and-no-one” SLO failure mode.