SLOs on Data Pipelines
Pipelines need different SLOs than APIs.
Three pipeline SLO dimensions
Freshness. How old is the data the pipeline produced. Critical for downstream consumers that depend on recent data.
Completeness. Did the pipeline process all expected records. Drops indicate upstream issues or transformation bugs.
Correctness. Sample-based: pick a small subset, verify outputs match expected. Hardest to measure but most important.
Freshness SLO mechanics
Express in time: 95% of partitions arrive within 30 minutes of source. Specific; comparable across pipelines.
Per-pipeline lag tracked continuously. Alert when sustained lag exceeds threshold.
Match SLO to downstream needs. Real-time dashboards: under 5 minutes. Daily reports: under 1 hour. Nightly batches: same day.
Completeness SLO mechanics
Expected record counts per partition or run. Compare actual to expected; alert on shortfall.
Causes of incompleteness: upstream missing data, schema validation drops, transformation errors, dead-letter queue arrivals.
Audit dead-letter rates monthly. Silent drops via try/except logging are the worst pattern; fix at source.
Operating pipeline SLOs
Per-pipeline dashboard with all three dimensions plus burn-rate.
Couple lag SLO to autoscale where possible. Lag breach triggers consumer scale-up.
Quarterly review. Pipeline workloads grow; SLOs that fit last year may be tight today.