Alert Batch vs Stream
Streaming alerts fire fast; batch alerts wait for windows.
Streaming vs batch alert evaluation
Streaming and batch are different evaluation patterns with different uses. Streaming alerts evaluate on every new sample with seconds detection latency, used for high-cardinality real-time signals like error rates and saturation; batch alerts evaluate on a fixed window (5 minutes, 1 hour, 1 day) with detection latency equal to at least one window, used for slower-moving signals like daily revenue or weekly SLO budgets. Most production stacks need both.
- Streaming evaluates per sample. Detection latency in seconds; high-cardinality real-time signals.
- Batch evaluates on window. 5 min, 1 hour, 1 day; latency at least one window.
- Different signal types. Streaming: error rates, saturation; batch: daily revenue, weekly SLO.
- Most stacks need both. Not one or the other; the patterns coexist.
When streaming wins
Streaming wins when latency budget is tight. User-facing latency (p99 above 200ms for 5 minutes shouldn’t wait for a 1-hour batch window); revenue impact (checkout error rate spikes need detection in under 60 seconds, not at the next hourly job); high-cardinality services where waiting compounds blast radius (streaming Prometheus rules with 30s evaluation are the default).
- User-facing latency. p99 > 200ms for 5 minutes; can’t wait for 1-hour batch.
- Revenue impact. Checkout error rate spikes; detection in under 60 seconds.
- High-cardinality services. Waiting compounds blast radius; streaming default.
- 30s Prometheus evaluation. The streaming-rule default; tunable per signal.
When batch wins
Batch wins when the signal is slow-moving. Daily and weekly aggregates (SLO burn over 30 days is a batch query by definition); cost-driven signals (cloud spend and data warehouse query cost are reported daily and best alerted on daily); anomaly detection on long windows (STL or Prophet models need batches large enough to fit a seasonal component).
- Daily/weekly aggregates. SLO burn over 30 days is batch by definition.
- Cost-driven signals. Cloud spend, warehouse query cost; reported daily, alerted daily.
- Long-window anomaly detection. STL or Prophet need batches with seasonal component.
- Per-signal cadence. The signal’s natural rhythm drives the evaluation cadence.
Hybrid pattern
Burn-rate SLO alerts use both. Fast burn (5m / 1h windows) triggers a page; slow burn (6h / 1d) opens a ticket; same SLO, two evaluation cadences. Stream the leading indicator and batch the lagging metric: streaming alerts catch the spike, batch alerts confirm the trend; Thanos or Mimir support long-window batch queries against the same Prometheus data the streaming rules use.
- Burn-rate dual cadence. Fast 5m/1h pages; slow 6h/1d tickets; same SLO.
- Stream leading, batch lagging. Streaming catches spike; batch confirms trend.
- Thanos or Mimir. Long-window batch queries on the same Prometheus data.
- Per-SLO hybrid config. The two cadences documented per SLO; supports investigation.
Pick by latency budget
The decision is latency-budget driven. If detection delay above 5 minutes is unacceptable, you need streaming; otherwise batch is cheaper and quieter; avoid streaming evaluation on signals that change less than once per hour because the query load is wasted; default streaming interval is 30 seconds for Prometheus, 60 seconds for Datadog, and batch is hourly or daily depending on the signal.
- 5-minute latency threshold. Above that unacceptable: streaming; below: batch.
- No streaming for hourly-change signals. Query load is wasted; batch is fine.
- 30s Prometheus, 60s Datadog defaults. Streaming intervals; tunable per signal.
- Per-signal documented cadence. The choice committed to the rule config; supports investigation.