The OTel Collector Deployment Pattern That Scales
Sidecar, daemonset, or gateway? The deployment topology that handles 10M+ spans per minute without falling over.
Sidecar: per-pod collector
OpenTelemetry collector deployment patterns determine where collectors run. Sidecar, DaemonSet, and Gateway are the three primary patterns; each has different operational characteristics. Mature deployments often use combinations; understanding the patterns is the foundation for choosing well.
What sidecar provides:
- One collector per app pod.: Each application pod has its own collector running as a sidecar container. The application sends telemetry to localhost; the collector handles the rest.
- Cheap to debug.: Each collector handles only one pod's telemetry. If the collector misbehaves, the impact is bounded to one pod. Investigation is straightforward.
- Failure isolated.: A failed sidecar affects only its pod. Other pods continue producing telemetry; the system as a whole is robust.
- Cost is high at scale.: Each pod has a collector. The aggregate resource consumption (CPU, memory) is large. For large deployments, the cost dominates other patterns.
- Each pod runs a collector even if it produces little telemetry.: The cost is per-pod regardless of telemetry volume. Pods with low telemetry volume still pay for a collector; the inefficiency is real.
Sidecar is good for development, small clusters, or specific high-isolation use cases. The cost makes it impractical at scale.
DaemonSet: per-node collector
DaemonSet deploys one collector per node. The collector serves all pods on the node; the resource cost is amortized across the pods. The pattern is the standard for production deployments.
- One collector per node, shared across pods.: Each Kubernetes node runs one collector. All pods on the node send their telemetry to the local collector; the collector forwards to the next layer.
- Reduces collector count by 10 to 100x.: A node typically hosts 10 to 100 pods. The DaemonSet pattern produces one collector per node instead of one per pod; the count reduction is dramatic.
- Failure affects all pods on the node.: The trade-off is reduced isolation. A failed DaemonSet collector affects all pods on the node. The blast radius is the node, not the pod.
- Debug is harder.: Investigating issues with a DaemonSet collector requires considering all the pods that use it. The investigation is more complex than for a sidecar.
- Standard pattern for production.: Despite the trade-offs, the DaemonSet pattern is the production standard. The cost savings are too significant to ignore; the failure modes are manageable with proper monitoring.
DaemonSet is the production answer. The pattern balances cost and operational characteristics for typical workloads.
Gateway: centralised pipeline
Gateway is the centralized aggregation layer. After per-pod or per-node collectors, telemetry flows to the gateway for final processing before reaching the vendor.
- Final aggregation before vendor send.: The gateway is the last stop before telemetry leaves the cluster for the vendor. The gateway can be many collector instances; together they form the gateway tier.
- Handles routing.: The gateway routes telemetry to the right vendor based on type and source. Logs to the log vendor, metrics to the metric vendor, traces to the trace vendor. The routing rules are part of the gateway configuration.
- Sampling.: Tail sampling happens at the gateway. The gateway has the full trace data (because it sees all spans for each trace); sampling decisions are informed by the complete trace.
- Transformation.: The gateway can transform telemetry: redact PII, enrich with metadata, drop unwanted attributes. The transformation produces the telemetry the vendor receives.
- Required at scale.: Without a gateway, sampling and routing happen in the collectors closer to the application. The decentralized decisions are harder to coordinate; the gateway centralizes them.
OTel collector deployment pattern is one of those architectural decisions that compounds across the team's observability lifetime. Nova AI Ops integrates with OTel collector deployments, surfaces deployment-pattern characteristics, and helps platform teams choose patterns that match their operational needs.