SLO & Reliability Practical By Samson Tanimawo, PhD Published Aug 1, 2025 4 min read

SLOs for Streaming Systems

Streaming: throughput, lag, errors.

The three dimensions of streaming SLOs

Throughput. Events processed per second per partition or shard. The SLO sets a floor below which the pipeline is failing its commitment, even if no single message is lost.

Lag. Consumer offset versus producer offset, expressed in time. A 10-minute lag commitment is meaningful for downstream that depends on freshness; a 1-hour lag commitment is for batch-shaped consumers.

Per-event errors. Events that hit dead-letter queues, fail schema validation, or get dropped by a faulty consumer. This is the quality dimension that throughput and lag alone miss.

Throughput SLO mechanics

Measure as events processed per second over a rolling window. 99% of 1-minute windows must exceed the floor. Per-partition floors prevent one slow partition from being averaged away.

Set the floor based on incoming traffic, not arbitrary numbers. If producer rate averages 5k events/sec, the consumer floor should be 5k/sec with headroom. Below incoming rate, lag grows.

Alert when the SLO is at risk, not on every slow window. Multi-window burn rate alerts: 1-hour, 6-hour, 3-day. Alert when all three burn faster than budget allows.

Lag SLO mechanics

Express lag in time units, not raw offsets. 1k offsets means little; 10 minutes of accumulated work is meaningful. Most observability tools compute time-lag from offset and partition rate.

Pick the lag floor by what downstream tolerates. Real-time dashboards: under 30 seconds. Hourly reports: under 5 minutes. Nightly batch: under 30 minutes. Match the SLO to the actual product need.

Watch for partition skew. Per-partition lag tells you whether the workload is balanced. If one partition is at 10-minute lag while others are at 0, the partition key is hot.

Quality SLOs for streaming

Per-event error rate is harder to measure but matters more. Schema-rejected events, deserialisation failures, dead-letter queue arrivals all count.

Sample-based correctness for high-volume streams. Take 1% of events, run them through a known-correct downstream, compare. The diff rate is the quality SLO.

Don't forget silent drops. A consumer with try-except logging warnings on parse errors looks healthy in metrics but is dropping data. Audit dead-letter rates monthly.

Operating streaming SLOs

Per-pipeline dashboard with all three SLO panels. Burn rate visible. Recent SLO breaches with cause.

Couple lag SLO to autoscale. When lag breaches, scale consumers. When lag recovers and stays healthy for 10 minutes, scale back. Avoid flapping with hysteresis.

Quarterly review: is the SLO still right? Streaming workloads grow, partitioning changes, downstream needs shift. The SLO that fit last year may be loose or tight today.