The Observability Maturity Model in Five Stages
Honest about where teams are and what the next step is. Most teams sit between stages 2 and 3 and overestimate their position.
Stage 1: logs and ssh
Engineers ssh into boxes; tail logs; sometimes write a dashboard nobody else can read. The system survives because someone always knows.
Cost: low until incident; then very high. Recovery depends on tribal knowledge.
Stage 2: dashboards and alerts
- Centralised dashboards in Grafana / Datadog; on-call alerts that page. The team has a runbook for the top 5 alerts.
- Cost: tooling subscriptions plus alert fatigue. Recovery time is bounded but on-call burns out.
Stage 3: SLOs and tracing
SLOs defined per service; tracing instrumentation across the request path; postmortems with action items.
Cost: meaningful engineering investment in observability as a discipline. Recovery time and burnout both improve.
Stage 4: unified telemetry
All telemetry flows through one pipeline (OpenTelemetry); queries cross metrics, logs, traces; dashboards composed from one data source.
Cost: migration from prior tools. Payback is faster diagnosis and lower aggregate spend.
Stage 5: agentic remediation
- Skipping stages. The discipline of stage N is what makes N+1 possible. Trying to deploy agents on chaotic telemetry fails predictably.
- Buying tools without changing practice. Datadog at stage 1 is just expensive logs.
The test that places you
Three moves. (1) Self-assess honestly: which stage describes your team in week 3 of an incident-heavy quarter? (2) Identify the next-stage capability you do not have. (3) Plan a quarter to add it; do not skip.