The OTel SDK Version Discipline
OTel SDKs evolve fast. The discipline that keeps versions current without breaking the fleet.
Pin versions
OpenTelemetry SDK versioning is one of those small disciplines that pays off proportionally to fleet size. Different SDK versions emit slightly different telemetry; debugging issues across services with mixed versions is harder; vendor compatibility matters per-version. Pinning the SDK version organization-wide eliminates these inconsistencies.
What pinning looks like:
- All services use the same SDK version.: Every service in the organization uses the same version of the OpenTelemetry SDK. The telemetry shape is consistent; the same fields appear; the same semantics apply.
- Drift causes inconsistent telemetry.: Different SDK versions sometimes emit different attributes for the same operation. Some versions emit deprecated attributes; some emit new ones. The mixed telemetry makes dashboards inconsistent and queries unreliable.
- Version is policy.: The approved SDK version is documented and treated as policy. Engineering leadership endorses the version; new services adopt it; existing services migrate to it.
- Approved versions are listed.: The list typically includes the current approved version and the immediately previous version (for grace-period transitions). Services on older versions are out of policy and need to migrate.
- New services use the approved one.: New service templates and starter kits include the approved SDK version. Developers do not need to research version choices; the template makes the right choice for them.
Pinning is the foundation. Without it, the SDK version landscape becomes a long tail of versions that nobody can fully support.
Upgrade cadence
Pinning is not freezing. The approved version evolves; the team upgrades on a regular cadence. The cadence is fast enough to capture improvements and security fixes; slow enough to not produce churn.
- Quarterly: bump the approved version.: Once per quarter, the team evaluates the latest SDK version. If it represents real improvement, the approved version is bumped. The cadence matches the SDK release pace.
- New version tested in pre-prod first.: Before promoting to approved, the new version is tested in pre-production. Telemetry shape verified; dashboards and queries verified; vendor compatibility verified. The promotion happens after testing.
- Backport critical fixes.: If the SDK has a critical fix between quarterly upgrades, the fix is backported. The approved version is updated to include the fix; services adopt the patched version. The discipline is responsive to real issues.
- Most upgrades are routine.: Most quarterly upgrades are uneventful. The SDK is mature; backward compatibility is the norm; the upgrade is mechanical. Some upgrades involve significant changes; those get extra attention.
- Document each upgrade.: Each version bump produces release notes documenting what changed and what to watch for. Teams using the SDK can reference the notes; surprises are minimized.
The cadence is what keeps the version current without creating constant migration work. Quarterly upgrades match the typical SDK release rhythm.
Track drift
Even with policy and cadence, drift happens. Services that miss upgrade windows; services where the upgrade caused issues that were not fixed promptly; legacy services that nobody is actively maintaining. The drift tracking surfaces these.
- Dashboard showing services on which SDK version.: The dashboard enumerates services and their SDK versions. The view is at-a-glance; outliers (services on old versions) are immediately visible.
- Outliers are surfaced.: Services significantly behind the approved version are flagged. The flag goes to the service owner; remediation is tracked.
- Goal: 95% of services on the current approved version.: Perfect is unrealistic; 95% is a meaningful target. It accommodates legitimate edge cases (services in maintenance mode, services in hand-off, etc.) while keeping the long tail bounded.
- Stragglers get individual attention.: The 5% that are not on the current version are handled individually. Each has a reason; each has a plan to migrate; each has a target date. The discipline prevents the long tail from growing.
- Track over time.: The percentage on the current version is a tracked metric. The trend matters: stable around 95% means the discipline is working; declining trend means the team needs more attention to upgrades.
OTel SDK version discipline is one of those observability hygiene practices that pays off in consistency. Nova AI Ops integrates with service telemetry and SDK version data, surfaces drift across the fleet, and produces the per-service migration queue that the platform team uses to keep versions in sync.