SLOs for Streaming Systems
Streaming: throughput, lag, errors.
The three dimensions of streaming SLOs
Streaming SLOs cannot be a single “is the pipeline up” signal. Three dimensions move independently and a healthy pipeline must be inside the floor on all three at once.
- Throughput. Events processed per second per partition or shard. The SLO sets a floor below which the pipeline is failing, even if no single message is lost.
- Lag. Consumer offset versus producer offset, expressed in time. A 10-minute lag commitment serves freshness-sensitive downstream; a 1-hour lag fits batch-shaped consumers.
- Per-event errors. Events hitting dead-letter queues, failing schema validation, or dropped by a faulty consumer. This is the quality dimension that throughput and lag alone miss.
- Coupled view. A pipeline can be at full throughput, low lag, and still failing 5 percent of events to the DLQ. Track all three or you ship a half-broken stream.
Throughput SLO mechanics
Throughput SLOs anchor to incoming traffic, not to arbitrary numbers. The floor is what the consumer must sustain to stop lag from growing.
- Window mechanics. Events processed per second over a rolling window. 99 percent of 1-minute windows must exceed the floor.
- Per-partition floor. Floors are per partition; otherwise one slow partition gets averaged away by healthy ones.
- Anchor to producer rate. If producer rate averages 5,000 events per second, the consumer floor sits at 5,000 per second plus headroom. Anything below incoming rate guarantees lag growth.
- Multi-window burn alerts. Alert on coupled 1-hour, 6-hour, and 3-day windows. Page when all three burn faster than budget allows; isolated single-window noise is ignorable.
Lag SLO mechanics
Lag is the dimension operators feel directly. Express it in time, anchor it to product expectations, and watch the per-partition view.
- Time, not offsets. 1,000 offsets is meaningless without context; 10 minutes of accumulated work is meaningful. Most observability tools compute time-lag from offset and partition rate.
- Pick the floor by downstream tolerance. Real-time dashboards under 30 seconds, hourly reports under 5 minutes, nightly batch under 30 minutes. Match the SLO to the actual product need.
- Partition skew. Per-partition lag tells you whether the workload is balanced. One partition at 10-minute lag while others sit at zero means the partition key is hot.
- Recovery target. When lag breaches, define how fast it must drain. “Back under the SLO within 15 minutes” gives autoscale a clear stop condition.
Quality SLOs for streaming
Quality is the SLO most teams skip because it is harder to measure. Skipping it is also why streaming pipelines lose data without anyone noticing.
- Per-event error rate. Schema-rejected events, deserialisation failures, and dead-letter queue arrivals all count toward the rate.
- Sample-based correctness. Take 1 percent of events, run them through a known-correct downstream, compare. The diff rate is the quality SLO.
- Silent drops. A consumer with try-except logging warnings on parse errors looks healthy in metrics but is dropping data. Audit dead-letter rates monthly.
- Owner per dimension. Quality belongs to whoever owns the schema; throughput and lag belong to the platform team. Without explicit ownership, quality regressions fall through the cracks.
Operating streaming SLOs
Operating the SLO is what makes it real. Without dashboards, autoscale coupling, and review cadence, the SLO is a slide.
- Per-pipeline dashboard. All three SLO panels in one place, burn rate visible, recent breaches annotated with cause.
- Autoscale coupling. When lag breaches, scale consumers. When lag recovers and stays healthy for 10 minutes, scale back. Hysteresis prevents flapping.
- Quarterly review. Streaming workloads grow, partitioning changes, downstream needs shift. The SLO that fit last year may be loose or tight today.
- Tie to budget. An SLO with no error-budget consumed gets tighter; one regularly burning gets the platform attention. Treat the budget as the prioritisation signal.