Observability Intermediate By Samson Tanimawo, PhD Published Dec 12, 2026 10 min read

The Observability Maturity Model in Five Stages

Honest about where teams are and what the next step is. Most teams sit between stages 2 and 3 and overestimate their position.

Stage 1: logs and ssh

Engineers ssh into boxes; tail logs; sometimes write a dashboard nobody else can read. The system survives because someone always knows.

Cost: low until incident; then very high. Recovery depends on tribal knowledge.

Stage 2: dashboards and alerts

Centralised dashboards in Grafana / Datadog; on-call alerts that page. The team has a runbook for the top 5 alerts.
Cost: tooling subscriptions plus alert fatigue. Recovery time is bounded but on-call burns out.

Stage 3: SLOs and tracing

SLOs defined per service; tracing instrumentation across the request path; postmortems with action items.

Cost: meaningful engineering investment in observability as a discipline. Recovery time and burnout both improve.

Stage 4: unified telemetry

All telemetry flows through one pipeline (OpenTelemetry); queries cross metrics, logs, traces; dashboards composed from one data source.

Cost: migration from prior tools. Payback is faster diagnosis and lower aggregate spend.

Stage 5: agentic remediation

Skipping stages. The discipline of stage N is what makes N+1 possible. Trying to deploy agents on chaotic telemetry fails predictably.
Buying tools without changing practice. Datadog at stage 1 is just expensive logs.

The test that places you

Three moves. (1) Self-assess honestly: which stage describes your team in week 3 of an incident-heavy quarter? (2) Identify the next-stage capability you do not have. (3) Plan a quarter to add it; do not skip.