Alert Batch vs Stream
Streaming alerts fire fast; batch alerts wait for windows.
Streaming vs batch alert evaluation
Streaming alerts evaluate on every new sample. Detection latency is seconds. Used for high-cardinality real-time signals like error rates and saturation.
Batch alerts evaluate on a fixed window: 5 minutes, 1 hour, 1 day. Detection latency equals at least one window. Used for slower-moving signals like daily revenue or weekly SLO budgets.
Both are valid. Most production stacks need both, not one or the other.
When streaming wins
User-facing latency. p99 latency above 200ms for 5 minutes should not wait for a 1-hour batch window.
Revenue impact. Checkout error rate spikes need detection in under 60 seconds, not at the next hourly job.
High-cardinality services where waiting compounds blast radius. Streaming Prometheus rules with 30s evaluation are the default.
When batch wins
Daily and weekly aggregates. SLO burn over 30 days is a batch query by definition.
Cost-driven signals. Cloud spend and data warehouse query cost are reported daily and best alerted on daily.
Anomaly detection on long windows. STL or Prophet models need batches large enough to fit a seasonal component.
Hybrid pattern
Burn-rate SLO alerts use both. Fast burn (5m / 1h windows) triggers a page; slow burn (6h / 1d) opens a ticket. Same SLO, two evaluation cadences.
Stream the leading indicator, batch the lagging metric. Streaming alerts catch the spike; batch alerts confirm the trend.
Use Thanos or Mimir for long-window batch queries against the same Prometheus data the streaming rules use.
Pick by latency budget
If detection delay above 5 minutes is unacceptable, you need streaming. Otherwise batch is cheaper and quieter.
Avoid streaming evaluation on signals that change less than once per hour. The query load is wasted.
Default streaming interval: 30 seconds for Prometheus, 60 seconds for Datadog. Batch: hourly or daily depending on the signal.