What observability is, and how it differs from monitoring
Observability is the property of a system that lets you understand its internal state from the outside, using only the telemetry it emits. The working test is simple: a system is observable if you can answer a brand-new question about its behavior without shipping new code to instrument it. If you have to add a log line, redeploy, and wait for the problem to recur before you can diagnose it, you do not have observability. You have logging.
The word comes from control theory, where a system is observable if its internal state can be inferred from its outputs. Applied to software, that means the metrics, logs, and traces a service emits must be rich enough, and queryable across enough dimensions, that an engineer can reconstruct what happened during an incident they did not anticipate.
That last clause is the whole game. The classic framing contrasts known-unknowns with unknown-unknowns. Monitoring is built for known-unknowns: you already know the failure modes you care about (disk fills up, p99 latency crosses 500ms, error rate exceeds 1%), so you build a dashboard and an alert for each one. Monitoring answers "is the thing I predicted happening?" Observability is built for unknown-unknowns: the failures nobody predicted, the ones that emerge only at the intersection of three conditions you never thought to graph together. Observability answers "what is happening, and why, when the dashboards all look green but customers are complaining?"
Concretely, the difference shows up in cardinality and queryability. A monitoring system pre-aggregates: it decides in advance which dimensions to keep (region, status code) and throws away the rest to keep storage cheap. An observability system keeps high-cardinality context (the exact request ID, user tier, build hash, feature flag state) so you can slice and filter after the fact, during the incident, by a dimension you only just realized matters. You do not need both philosophies to be enemies. In practice observability is the superset that makes monitoring more useful: you still want pre-built dashboards and alerts for the failures you can predict, and you want the raw observability data underneath for everything you could not.
The three pillars: metrics, logs, and traces
The canonical model of observability is three signal types, each strong exactly where the others are weak. The mistake teams make is betting on one and discovering the blind spot during an incident. Mature observability uses all three together.
| Pillar | Strong at | Weak at |
|---|---|---|
| Metrics | Cheap, always-on health; alerting; trends over time | Explaining a single request; high-cardinality slicing |
| Logs | Detailed per-event context on one component | Cross-service latency; cost at high volume |
| Traces | Where time and errors go across many services | Always-on coverage (usually sampled); raw cost |
Metrics: cheap numbers that tell you something is wrong
A metric is a numeric value measured over time: request rate, error count, p99 latency, CPU utilization, queue depth. Metrics are cheap because they are pre-aggregated into time series, a long run of (timestamp, value) pairs, so a year of one metric costs almost nothing to store and is instant to query. That makes metrics the natural foundation for dashboards and alerts. Their weakness is the flip side of their strength: aggregation throws away the per-request detail. A metric tells you error rate jumped to 4%, but not which requests failed or why. And every label you add (per-endpoint, per-customer, per-region) multiplies the number of time series, so metrics cannot carry high-cardinality context without exploding in cost. Reach for metrics when you want a constant, cheap pulse on system health and the signal to fire an alert.
Logs: timestamped records of what happened
A log is a timestamped record of a discrete event: a request was served, a query ran, an exception was thrown. Logs are where the detail lives, the stack trace, the offending SQL, the exact parameters. The single most valuable upgrade most teams can make is moving from unstructured text logs to structured logging: emit each log line as JSON with named fields (request_id, user_id, latency_ms, status) instead of a human-readable sentence. Structured logs are queryable and aggregatable; text logs force you into fragile grep. Logs are bad at two things: cost at high volume (verbose logging at scale gets expensive fast), and answering cross-service questions, because each log line describes one component in isolation and nothing stitches them into the story of a single request. Reach for logs when a metric or trace has told you which component to look at and you need the gory detail of what it did.
Traces: following one request across many services
A trace follows a single request as it moves through every service that touches it. In a monolith you barely need traces. In a microservices architecture where one user click fans out to a dozen backend calls, traces are the only pillar that can tell you which downstream call made the request slow. A trace shows the full call tree with timing on each hop, so the answer to "why is checkout slow?" goes from a multi-hour cross-team investigation to a glance. Traces are usually sampled (you cannot afford to trace every request at scale), which is why they complement rather than replace metrics for alerting. Reach for traces when latency or errors span service boundaries and you need to know where, in a distributed call path, the problem actually lives.
The honest take on "three pillars." Treating metrics, logs, and traces as three separate stores you query separately is the legacy model, and it has a real cost: you context-switch between three tools mid-incident and manually correlate by hand. The modern direction, covered in the next section, is to derive all three from one rich event stream so they are correlated by construction. Think of the three pillars as three views of your system, not three databases you maintain in isolation.
Beyond the three pillars: events, profiles, and OpenTelemetry
The three-pillars model is a useful teaching frame, but it is not the frontier. Two ideas matter for where observability is going in 2026.
Wide structured events
The most important shift is from three separate signal types toward wide structured events: a single record per unit of work that carries dozens or even hundreds of dimensions as queryable fields. One event for "handled this HTTP request" might carry the request ID, user tier, region, build version, feature-flag state, cache hit/miss, downstream latencies, and the final status, all on one row. Metrics, logs, and traces then become different ways of looking at that same event stream rather than three databases you populate independently. The payoff is twofold: you escape the cardinality ceiling of pre-aggregated metrics, and you can ask genuinely new questions after the fact ("show me requests from enterprise-tier users in eu-west on build 4471 that were cache misses"), which is the literal definition of observability.
Continuous profiling
Profiling is sometimes called a fourth pillar. Continuous profiling samples where your code actually spends CPU and allocates memory, in production, all the time, not just in a one-off profiling session. When a trace tells you a service is slow but the service makes no obvious slow calls, a profile shows you the hot function inside it. It closes the last gap between "which service" (traces) and "which line of code" (profiles).
OpenTelemetry: the unifying standard
OpenTelemetry (OTel) is the open, vendor-neutral standard for generating and collecting all of this telemetry, and it is the single most consequential development in observability over the last few years. OTel defines one set of APIs, SDKs, and wire formats for metrics, logs, and traces, plus a Collector that receives, processes, samples, and exports telemetry to any backend. The strategic value is that you instrument your code once, against OTel, and stay free to switch or run multiple backends without re-instrumenting. Before OTel, choosing a vendor meant choosing their proprietary agent and getting locked in. With OTel, the instrumentation is yours, the backend is a swappable detail, and the Collector becomes the control point where you shape and budget your telemetry. If you are starting an observability practice in 2026, instrument with OpenTelemetry first and pick backends second.
Distributed tracing in depth
Distributed tracing deserves its own section because it is the pillar teams most often skip and most often regret skipping. Here is the model in full.
Spans, traces, and the call tree
The atomic unit is a span: one named, timed operation, for example "GET /checkout", "SELECT from orders", or "publish to Kafka". Each span records a start time, a duration, a status, and a bag of key/value attributes. A trace is the tree of spans for one request: a root span (the inbound request) with child spans for every operation it triggered, nested to show causality and timing. Render a trace as a waterfall and you can read, at a glance, that the request took 800ms, that 600ms of it was one database call, and that the call returned an error. That is the diagnosis, in one picture, that used to take an hour of cross-team log spelunking.
Context propagation: the part that actually matters
The mechanism that makes tracing work across service boundaries is context propagation. When service A calls service B, it injects the trace ID and the current span ID into the outbound request headers (the W3C Trace Context standard, the traceparent header, is what OTel uses). Service B reads those headers, continues the same trace, and its spans link back to A's. Propagate that context through every hop, including across async boundaries like message queues, and the spans from a dozen independent services assemble into one coherent tree. Miss it on a single hop and the trace breaks into disconnected fragments, which is the single most common tracing failure in practice. Getting propagation right everywhere is most of the work of adopting tracing.
Why it matters for microservices
In a monolith, a stack trace already tells you the full call path; tracing adds little. The value of distributed tracing scales with the number of network hops per request. Once a single user action fans out across many services, no log on any one service can tell you where the latency went, because each service only sees its own slice. Tracing is the only pillar that reconstructs the whole journey. It is also what turns "the site is slow" from a finger-pointing meeting into a fact: the waterfall names the responsible service and the responsible call, so the right team owns it immediately. For teams running microservices, tracing is not optional polish; it is the difference between a diagnosable and an undiagnosable architecture.
See your metrics, logs, and traces correlated automatically, with root cause already identified.
Try Nova →Observability vs monitoring vs APM
These three terms get used interchangeably and they should not be. The distinctions are practical, not pedantic.
| Term | What it is | Where it fits |
|---|---|---|
| Monitoring | Watching predefined dashboards and alerts for known failure modes | Answers "is the thing I predicted happening?" A subset of observability. |
| Observability | The capability to ask arbitrary new questions of high-cardinality telemetry | The goal. Spans apps, infra, networks, and business events. |
| APM | A commercial product category for app-level latency, errors, and traces | One tool that helps reach observability, scoped to the application layer. |
Monitoring is the practice of watching for conditions you defined in advance. It is necessary and not going anywhere: you will always want a dashboard that turns red when the database is down. Monitoring is a subset of observability, the part that handles the predictable failures.
Observability is the broader capability to interrogate your system about things you did not predict. It spans more than applications: infrastructure, networks, queues, and business-level events all belong in an observability strategy. Observability is the goal; monitoring is one practice inside it.
APM (application performance monitoring) is a commercial product category, not a property of your system. APM tools focus on the application layer: request latency, error rates, and traces, usually delivered through a vendor agent and pre-built dashboards. A good APM tool is genuinely useful and often the fastest way to get tracing and latency visibility for your services. But it is scoped to the application, and it is one of several tools you assemble into an observability strategy, not the strategy itself. The clean way to hold it: observability is the outcome you want, monitoring and APM are practices and products that help you get there.
The 2026 observability tooling landscape and the cost problem
The 2026 landscape is defined by a strong open-standards core surrounded by commercial platforms that add convenience, scale, and analysis on top.
The open-standards core
OpenTelemetry is the instrumentation and collection standard, and it has effectively won; it is the safe default for generating telemetry. Prometheus is the de facto open standard for metrics: a pull-based time-series database with the PromQL query language, paired almost universally with Grafana for dashboards and visualization. For logs and traces, open options like Loki, Tempo, and Jaeger round out a fully open stack. A team can build a credible observability practice entirely on open tooling, which is a genuine change from a decade ago when serious observability meant a six-figure commercial contract on day one.
Where commercial platforms fit
Commercial platforms (the large observability vendors) earn their place in three areas: they remove the operational burden of running the storage and query tier at scale, they unify metrics, logs, and traces in one correlated interface, and they layer on analysis like automated anomaly detection and service maps. The reasonable 2026 posture is to instrument with OpenTelemetry so your data stays portable, then choose a backend, open or commercial, on the merits, knowing you can change your mind without re-instrumenting. That decoupling is exactly what OTel was designed to give you.
The cost problem: cardinality and data volume
The defining operational challenge of observability in 2026 is cost, and it has two engines: data volume and cardinality. Volume is the sheer quantity of logs and spans a busy system produces. Cardinality is subtler and more dangerous: every unique combination of label values on a metric creates a separate time series, so adding a single high-cardinality label like raw user ID or request ID to a metric can multiply your series count by millions and blow up both storage and query cost. Teams that ignore this wake up to an observability bill that rivals their compute bill.
The levers that bring it under control are well understood. Sample traces intelligently so you keep the interesting ones (errors, slow requests) and drop the boring majority. Process at the Collector: drop, aggregate, or down-sample low-value telemetry before it ever reaches paid storage. Tier your retention so hot, queryable data lives for days while cheap cold storage holds the rest. And keep high cardinality off your metrics, push it into events and traces where it belongs, rather than onto pre-aggregated time series. The principle behind all four: pay for signal, not for raw volume.
From observability to action
Here is the point the tooling vendors underplay: data alone is not reliability. A perfect observability stack that nobody acts on prevents zero outages. Observability is the raw signal; reliability is what you do with the signal. The value is realized only at the point where something consumes the telemetry and changes an outcome.
That signal flows into four downstream practices. It feeds alerting, and here observability is double-edged: rich telemetry makes it trivial to create too many alerts, which is exactly how teams end up drowning in noise (see our guide to alert fatigue and how to fix it). It feeds SLOs and error budgets, where the metrics you collect become the basis for the reliability targets you commit to (see SLOs and error budgets). It feeds incident response, where traces and logs are what an on-call engineer, or an AI agent, reads to find root cause and cut MTTR (the full flow is covered in AI incident response). And increasingly it feeds autonomous remediation, where the signal does not stop at a human dashboard but drives an action.
This is where Nova AI Ops sits in the stack. Nova is not another place to store metrics; it is the layer that consumes the observability signal you already produce. It ingests your metrics, logs, and traces, correlates them across AWS, GCP, Azure, Linux, and Windows in one model rather than five disconnected views, identifies the probable root cause with provenance, and auto-resolves routine incidents within a policy envelope you define. Observability tells you what is happening and gives you the evidence; Nova reads that evidence, reaches the diagnosis, and acts on the well-understood cases so a human only sees the genuinely novel ones. The observability stack and the action layer are complementary: you keep your OpenTelemetry instrumentation and your backend of choice, and Nova turns their output into resolved incidents. For the foundational practice this all rests on, see site reliability engineering and the broader AIOps category.
A 90-day plan and a 10-point maturity checklist
A practical sequence for standing up observability that minimizes wasted spend and shows value early. The principle throughout: instrument with open standards first, prove value on one service, then scale.
Days 1-14: Instrument one service with OpenTelemetry
Pick one important service and instrument it end to end with OpenTelemetry: metrics, structured JSON logs, and traces. Stand up the Collector and point it at a backend (open or commercial, your choice). Goal: prove the pipeline works and the team can read a trace waterfall. Do not boil the ocean; one well-instrumented service teaches more than ten half-instrumented ones.
Days 15-45: Establish the three pillars and structured logging everywhere
Roll instrumentation across the critical request path. Convert text logs to structured logging so they are queryable. Make sure trace context propagates across every hop, including async queues, because a broken propagation chain is the most common early failure. Stand up Grafana dashboards for the golden signals: latency, traffic, errors, saturation.
Days 46-75: Define SLOs and control cardinality
Turn the metrics you now collect into SLOs and error budgets so the data drives commitments, not just dashboards. At the same time, audit cardinality: find the labels exploding your series count, move high-cardinality context off metrics and into events and traces, and configure sampling and Collector-side processing so cost tracks signal. This is the phase where a sloppy rollout starts getting expensive; do the cost hygiene now.
Days 76-90: Connect observability to action
Wire the signal into alerting (tuned to avoid noise), incident response, and an action layer. This is where a platform like Nova AI Ops consumes the telemetry to correlate, diagnose, and auto-resolve routine incidents, so the observability investment converts into fewer pages and faster resolution rather than just prettier graphs. Document the before/after MTTR and page count to justify expanding coverage to remaining services.
The 10-point observability maturity checklist
Score yourself honestly. Each "yes" is a level of maturity; the gaps are your roadmap.
- Are all three pillars in place? Metrics, logs, and traces for your critical services, not just one or two.
- Is your instrumentation vendor-neutral? Built on OpenTelemetry, so you can switch backends without re-instrumenting.
- Are your logs structured? Queryable JSON with named fields, not grep-only free text.
- Does trace context propagate end to end? Across every service hop and async boundary, with no broken traces.
- Can you ask new questions without deploying? The true test of observability: high-cardinality data you can slice after the fact.
- Do you have SLOs derived from your telemetry? Error budgets that turn observability data into reliability commitments.
- Is cardinality under control? No unbounded labels on metrics; high-cardinality context lives in events and traces.
- Is your telemetry cost predictable? Sampling, Collector-side processing, and retention tiers in place, so cost tracks signal.
- Are the three pillars correlated? You can jump from a metric spike to the relevant traces and logs without manual stitching.
- Does the signal drive action? Observability feeds alerting, incident response, and ideally autonomous remediation, not just dashboards nobody watches.
Most teams sit around five or six of these. The gap between six and ten is where observability stops being a cost center and starts measurably preventing outages and cutting resolution time.
Frequently asked questions
What is observability?
What is the difference between observability and monitoring?
What are the three pillars of observability?
What is distributed tracing?
What is OpenTelemetry?
What is the difference between observability and APM?
Why is observability so expensive, and how do teams control the cost?
Do I need all three pillars, or can I pick one?
What are wide structured events, and how do they relate to the three pillars?
How does observability connect to incident response and automated remediation?
Related guides
This page covers foundational observability: metrics, logs, and traces for all systems. For the LLM and AI-agent-specific angle, monitoring prompts, token cost, hallucination rate, and model drift, see the sibling guide to AI observability, which applies these ideas to AI workloads rather than general infrastructure. Go deeper on what consumes the signal: AI SRE, Agentic SRE (the architecture), AIOps, AI incident response, incident management, root cause analysis, and self-healing infrastructure. On the operational metrics and practices the signal drives: MTTR, alert fatigue, on-call management, and DevOps automation. On the foundations: site reliability engineering, SLOs and error budgets, blameless postmortems, and chaos engineering. For teams shipping AI systems: the AI engineer's guide to production reliability and LLMOps. On turning telemetry into signal and capacity: anomaly detection, capacity planning, and eliminating toil. See the full platform on Nova features.
Turn your observability data into resolved incidents.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. It consumes your existing metrics, logs, and traces, correlates them across AWS, GCP, Azure, Linux, and Windows, finds root cause, and auto-resolves routine incidents within your policy envelope. Free tier available for small teams.