The OTel Collector Config Discipline
OTel collector configs sprawl. The discipline that keeps them maintainable and tested.
Version control
The OpenTelemetry Collector configuration is one of the most operationally significant pieces of an observability stack. The config determines what telemetry is collected, what is filtered, what is transformed, and where it goes. Without discipline, the config drifts: changes happen in production without review; debugging changes are forgotten; the config becomes opaque. Discipline is what keeps the configuration manageable as it grows.
What good version control looks like:
- Collector configs in git.: The config lives in git, alongside other infrastructure code. The history is preserved; the changes are tracked; the source of truth is clear. The git copy is what gets deployed; nothing reaches production that has not been committed.
- Reviewed like code.: Changes to the config go through pull request review. The reviewer checks for correctness, performance impact, security implications. The review catches issues before they reach production.
- Deployed via CI.: The config is deployed through a CI pipeline. The pipeline validates, tests, and applies the change. The pipeline produces an audit trail of every change.
- No clicked-together configs in production.: Production should not have configs that exist only because someone clicked them together in an emergency. Every production config originates from version-controlled source. Emergency changes are followed by a commit that records what was done.
- Tagged releases.: Major config versions are tagged. Rollback to a known-good version is fast: deploy the tagged release. Without tags, rollback requires identifying the right git SHA.
Version control is the foundation. Everything else assumes that the config is in git and that production matches what is in git.
Test the config
An untested config is a guess about behavior. Testing produces confidence that the config does what is intended. The investment in testing is small relative to the cost of broken telemetry pipelines.
- Unit tests.: Send sample telemetry through the collector; verify the processors apply correctly; verify the exporters receive the right data. The tests run in CI; broken processors fail the build before reaching production.
- Send sample telemetry through.: The test harness produces synthetic logs, metrics, and traces. The harness pushes them through the collector and observes the output. The output is compared to expectations.
- Verify processors and exporters do the right thing.: Did the filter drop what it should drop? Did the transform apply correctly? Did the exporter route to the right backend? Each transformation step has explicit verification.
- Linting.: The OpenTelemetry collector linter catches common errors: invalid syntax, missing required fields, deprecated configurations. The lint runs on every commit; common errors are caught before they reach the test stage.
- Standard collector lint catches common errors before deploy.: The lint is fast (seconds) and catches most syntactic errors. The investment is small; the rejection of bad configs is worth it.
Testing is what turns config changes from gambles into deliberate, validated changes. Without testing, every production deploy carries the risk that the config does something unexpected.
Scale the discipline
The config grows over time. New services, new requirements, new exporters. Without scaling discipline, the config becomes monolithic and unmanageable. Per-environment overlays and modular configuration patterns keep the config navigable.
- Per-environment overlays.: A base config is shared across environments; per-environment overlays apply differences. Production might have additional exporters; dev might have looser sampling; staging might have additional processors. The overlay structure isolates differences.
- Base config plus dev/staging/prod overrides.: The pattern is well-established. Tools like Helm, Kustomize, and the OpenTelemetry Operator support this structure. The team picks the tooling and applies it consistently.
- Each overlay reviewed.: Changes to per-environment overlays go through the same review as base config changes. Production-specific changes are not exempt from review; they are arguably more important to review carefully.
- No environment-specific magic.: No configuration that exists only in production and not in version control. No "we set this in production directly because dev does not need it"; that pattern produces unmanageable configurations over time.
- Lives only in production is a smell.: Any config that lives only in production warrants investigation. If it is correct, it should be in version control. If it is wrong, it should be removed.
OTel collector config discipline is one of the foundational practices that makes observability sustainable at scale. Nova AI Ops integrates with collector deployments, surfaces config drift, and produces the change-tracking visibility that the platform team uses to maintain configuration health across environments.