Symptoms of a Saturated OTel Collector
Saturated collectors drop telemetry silently. The symptoms, the metrics to watch, and the mitigations.
Symptoms
Receiver dropped batches. Telemetry data lost at the source; otelcol_processor_dropped_spans rate above zero.
Exporter retry rates climbing. Backend cannot keep up; data buffered or dropped.
Memory utilisation at the cap. Collector OOM kills imminent.
Metrics to alert on
otelcol_processor_dropped_spans rate > 0 sustained. Any sustained drop is data loss.
otelcol_exporter_send_failed_spans_total rate increasing. Backend issue or saturation.
Memory and CPU utilisation. Approach to caps; preempt OOM.
Mitigations
Scale horizontally: add collector instances. Round-robin or sticky distribution.
Tune batching: larger batches, fewer exports. Reduces per-batch overhead.
Drop low-value telemetry first. Increase head sampling rate temporarily; restore after recovery.
Design for saturation
Right-size collectors based on traffic. Default sizing usually undersized for production.
Auto-scaling on memory and CPU thresholds. Don't rely on fixed-size deployments at peak.
Backpressure to producers when collector is saturated. Better to slow producers than drop data.