Symptoms of a Saturated OTel Collector
Saturated collectors drop telemetry silently. The symptoms, the metrics to watch, and the mitigations.
Symptoms
Saturated OpenTelemetry collectors fail in three observable ways. Receiver drops at the source, exporter retries climbing as the backend falls behind, and memory pressure leading to OOM. The queue-depth gauge catches saturation before drops start.
- Receiver dropped batches.
otelcol_processor_dropped_spansrate above zero. Telemetry lost at the source. - Exporter retry rates climbing. Backend cannot keep up. Data buffered or dropped depending on configuration.
- Memory at the cap. OOM imminent. Hard kill arrives if mitigation does not.
- Queue-depth gauge. In-flight queue size. The early-warning signal before drops start.
Metrics to alert on
The alert set is small and specific. Sustained drops are data loss; climbing failure rate is a backend or saturation issue. Memory and CPU triggers fire before OOM.
otelcol_processor_dropped_spansrate > 0 sustained. Data-loss alarm. Any sustained drop is real data lost.otelcol_exporter_send_failed_spans_totalrate increasing. Backend-or-saturation alarm. Distinguish vendor-side from collector-side at investigation time.- Memory and CPU utilisation. Resource-saturation alarm. Preempt OOM with auto-scaling triggers.
- Per-cluster dashboard panel. At-a-glance collector-health view. Investigation starts here.
Mitigations
Three mitigation modes cover most saturation events: scale out, tune batching, drop low-value telemetry. Each has a named command in the runbook.
- Scale horizontally. Add collector instances. Round-robin or sticky distribution per cluster.
- Tune batching. Larger batches, fewer exports. Reduces per-batch overhead at the cost of slightly higher latency.
- Drop low-value telemetry first. Increase head-sampling rate temporarily. Restore after the saturation event resolves.
- Documented runbook. Per-mitigation the named command and rollback. Fast response under pressure.
Design for saturation
Saturation handled at design time avoids the 3am page. Right-size collectors, configure auto-scaling, and prefer backpressure over silent drops.
- Right-size collectors. Traffic-based sizing per cluster. Default sizing is usually undersized for production volume.
- Auto-scaling thresholds. Memory and CPU triggers. Fixed-size deployments fail at peak.
- Backpressure to producers. Slow producers rather than drop data. Better to backpressure than to lose telemetry silently.
- Load-test calibration. Synthetic-load capacity test per cluster. Catches under-sizing before production does.