The Multi-Agent OS for SRE & DevOps

AI Observability: Seeing Inside Your LLM and Agent Apps in 2026

The classic three pillars of observability, metrics, logs, and traces, were built for deterministic systems. LLM and agent applications are not deterministic, and a 200 OK in 800ms can still be a completely wrong answer. This is the definitive 2026 guide to AI observability: what it is, the new signals that matter, why generic APM cannot debug an LLM app alone, the tooling landscape, how observability feeds incident response, and a 90-day rollout plan.

17 min read Published May 2026 By Dr. Samson Tanimawo, Nova AI Ops
AI observability diagram showing LLM tracing, prompts, completions, token usage, tool calls, and eval scores feeding into detection and incident response

What is AI observability?

AI observability is the practice of capturing, tracing, and evaluating the behavior of LLM and agent applications so you can debug, improve, and operate them in production. It extends the classic observability discipline with a set of AI-specific signals: the prompts and completions, the token usage, the latency of each step in a chain, the tool calls an agent makes, the context retrieved for a RAG query, and eval scores that judge whether the output was correct, grounded, and safe. The mission is unchanged from the day observability was coined: answer why the system behaved the way it did. The questions are just new. They are now about model decisions, not only request paths.

The reason this is its own category, rather than a feature of your existing monitoring, is that LLM apps break a foundational assumption of the three pillars. Metrics, logs, and traces were designed for deterministic software: the same input produces the same output, and a failure shows up as an error, a timeout, or a saturated resource. An LLM produces a different output for the same input, and its worst failures are silent. The service returns a clean 200, the latency is fine, no exception is logged, and the answer is confidently wrong. None of the classic pillars sees that, because nothing technically failed.

So AI observability is not a replacement for your APM. It is a layer on top of it that captures the model's reasoning and judges its quality. If you are building or operating LLM systems, this sits next to the broader practice of LLMOps, the operational discipline for shipping and running LLM apps. Observability is the part of LLMOps that answers "is it working, and if not, why."

Traditional observability vs the new AI signals

The three pillars (metrics, logs, traces) still apply, to the infrastructure your model runs on. What changes is that a whole second layer of signal sits on top, capturing what the model actually did. The table below maps the classic question to its AI-era counterpart.

Question Traditional observability AI observability adds
What happened?Logs and traces of service callsPrompt, completion, and tool calls per request
How slow was it?Request latency, p50/p95/p99Latency per step in a chain or agent loop
What did it cost?CPU, memory, infra spendToken usage split by input and output
Was it correct?Not captured (200 OK looks healthy)Eval scores for correctness and groundedness
Did it hallucinate?Invisible to APMHallucination and groundedness scoring
Which version regressed?Deploy markers on a graphPrompt and model version diff on the span
What did the agent retrieve?Not modeledRetrieved context attached to the trace
Is the infra healthy?Metrics and traces (still the source of truth)Correlated via OpenTelemetry, not replaced

Read top-to-bottom, the pattern is clear: the infrastructure questions stay with classic observability; the model questions need new signals. The two are not competitors. The best setups correlate them so a single trace shows both the service path and the model's reasoning, stitched together by OpenTelemetry's GenAI semantic conventions.

The one-line distinction. Traditional observability answers deterministic questions (which service is slow, which request errored). AI observability answers probabilistic ones (was the answer correct, was it grounded, which prompt version regressed quality, how much did this conversation cost). Same discipline, new signals. You need both layers, wired together.

Why you cannot debug an LLM app with Datadog alone

Datadog, Grafana, New Relic, and the rest of the APM field are genuinely excellent at what they were built for: latency, error rates, resource saturation, and distributed traces of your service calls. If your problem is an infrastructure problem, these tools find it fast. The trouble is that an LLM failure is usually not an infrastructure failure.

Picture the failure mode. A user asks your support assistant a question. The retrieval step pulls the wrong documents. The model, working from bad context, produces a fluent, confident, completely incorrect answer. The HTTP response is a 200. The latency is 800ms, well inside SLO. No exception is thrown, no log line says "error," no metric moves. Every dashboard is green. And the answer was wrong in a way that, at scale, erodes user trust or creates real liability.

To catch that, you need four things attached to the span that generic APM does not capture by default:

1The prompt

The full assembled prompt: system instructions, user input, few-shot examples, and any injected context. Without it you are debugging blind, because the model's behavior is a function of the prompt you cannot see.

2The retrieved context

For RAG apps, the chunks the retriever returned. Most "the model hallucinated" bugs are actually "the retriever fetched the wrong context" bugs, and you cannot tell them apart without the retrieval payload on the trace.

3The completion

The exact output the model produced, including tool calls and their arguments. For multi-step agents, every intermediate step, because a final wrong answer often traces back to a single bad step five hops earlier.

4An eval score

A judgment of quality (correctness, groundedness, safety) attached to the span. This is the signal that turns "200 OK" into "200 OK but the groundedness score was 0.2," which is the line generic APM cannot draw.

The fix is not to rip out Datadog. The fix is to layer LLM-native tracing and evals on top of it, and correlate the two through OpenTelemetry so one trace tells the whole story. Teams that try to force-fit an LLM app into pure infrastructure monitoring end up flying blind on exactly the failures that matter most. If you operate the systems that run on this telemetry, see how it fits into AI SRE and the broader reliability practice.

See LLM traces, evals, and infra signals correlated in one view, end to end.

Try Nova →

The 2026 AI observability landscape

The market splits into three layers. Most production teams run one tool from the tracing-and-eval layer, optionally a gateway, and their existing APM underneath, all stitched together with OpenTelemetry. The architectural test below is how to choose within each layer.

Layer 1: LLM tracing and eval platforms

These are the heart of AI observability: capture prompts, completions, and tool calls; trace multi-step chains and agent loops; run evals offline and online. Examples: LangSmith (managed, deep eval tooling, tight LangChain integration but framework-agnostic via its SDK), Arize Phoenix (open-source, OpenTelemetry-native, strong for self-hosting and trace analysis), and Langfuse (open-source, self-hostable, good prompt management and cost tracking). The strength is they speak the LLM domain natively; the tradeoff is you still need an infra APM underneath for the deterministic layer.

Layer 2: Gateways and cost observability

Proxy-based tools that sit in front of your model calls and capture usage, latency, and cost without code changes deep in the app. Examples: Helicone (one-line proxy, request logging, cost analytics, caching) and Portkey (gateway with routing, fallbacks, and observability). The strength is near-zero integration effort and excellent cost visibility; the tradeoff is the proxy sees the request boundary, not always the full internal step graph of a complex agent, so pair it with Layer 1 for deep tracing.

Layer 3: The classic APM underneath

Your existing infrastructure observability: Datadog, Grafana (with Tempo/Loki/Prometheus), and the OpenTelemetry collector that increasingly glues all three layers together. This layer does not go away. It is the source of truth for the deterministic part of the system, and OpenTelemetry's GenAI semantic conventions are what let an LLM span from Layer 1 correlate with a service trace here.

The right pick depends on your priorities: open-source self-hosting (Phoenix, Langfuse), managed eval depth (LangSmith), or proxy-based cost control (Helicone, Portkey). Whatever you choose, standardize on OpenTelemetry so the layers correlate instead of living in three disconnected dashboards. For the full operational picture this observability stack feeds, see our guide to the Nova platform features.

Observability that only renders dashboards is half a system. The other half is what happens when a signal goes bad. AI observability is the detection and diagnosis layer; incident response is everything after.

The connection works in three moves. First, an eval threshold becomes an alert. Groundedness drops below 0.7 on sampled traffic, hallucination rate doubles week-over-week, or cost per request jumps after a deploy. Each of those is a numeric signal you can wire to alerting exactly like a latency SLO. Second, the trace becomes the diagnosis evidence. Because the prompt, retrieved context, completion, and eval score are all attached to the span that fired the alert, the on-call engineer (or an agent) opens the alert and the smoking gun is right there. No reproduction step, no "can you get me the prompt," the forensic detail is already captured.

Third, the diagnosis drives a remediation. A quality regression usually correlates to a small set of causes: a prompt change, a model version bump, a retriever config change, or a misbehaving tool. An agentic platform can correlate the regression to the offending change and remediate inside a policy envelope: roll back the prompt to the last known-good version, switch the model route, or disable a flaky tool, with every action recorded in an audit ledger. This is exactly the loop described in our guide to AI incident response across the incident lifecycle: detect from the eval signal, diagnose from the trace, remediate within policy, audit the action.

The takeaway: do not treat AI observability as a separate project from reliability. Wire eval thresholds to your alerting from day one, and make sure the trace that fires the alert carries the evidence the responder needs. Observability without that link is a museum of dashboards nobody acts on.

The 10-point AI observability checklist

Use this to audit your current setup or evaluate a tool. A mature AI observability practice answers all 10 concretely; gaps tell you exactly where you are flying blind.

  1. Do you capture full prompts and completions? Not a redacted summary, the actual assembled prompt and the actual output, attached to every LLM span.
  2. Do you trace every step of multi-step chains and agent loops? A final wrong answer often traces to a single bad intermediate step; you need each one as its own span.
  3. Do you account for token usage and cost per request? Split by input and output tokens, rolled up per conversation and per feature, not just a monthly bill.
  4. Do you version and diff prompts? When quality regresses, can you see exactly which prompt version shipped and what changed from the last good one?
  5. Do you run offline evals in CI before deploy? A golden dataset scored on every change, so a quality regression is caught before it reaches users.
  6. Do you run online evals on live traffic? Sampled scoring of production responses, because offline sets never cover the full distribution of real inputs.
  7. Do you detect hallucination and groundedness regressions? A specific score for "is this answer supported by the retrieved context," trended over time.
  8. Do AI spans correlate with infrastructure traces? Via OpenTelemetry, so one trace shows both the model reasoning and the service path it ran on.
  9. Are eval thresholds wired to alerting? A groundedness drop or cost spike should page or notify, exactly like a latency SLO breach.
  10. Do alerts connect to a remediation and audit path? The trace carries the evidence, and there is a defined way to roll back the prompt, switch the model, or disable a tool, all logged.

The economics: tooling cost vs token savings

AI observability is usually pitched as a debugging tool. The stronger financial case is that it pays for itself in token savings before it pays for itself in saved engineering time.

Cost line 1: tooling. Open-source tracing (Arize Phoenix or Langfuse, self-hosted) is near-free in license but costs engineering time to run and scale. Managed platforms run from roughly $0 on a free tier up to a few thousand dollars per month at startup volume, scaling with trace count and retention. For most teams the tooling spend is a rounding error next to the model bill.

Cost line 2 (the big one): the token spend observability lets you control. Once you have cost-per-request tracing, the waste becomes visible: a prompt that grew to 12K tokens of context when 2K would do, a verbose system prompt copied into every call, a simple classification task being routed to the most expensive model. Teams that add this visibility routinely cut LLM spend 20-40% in the first quarter just by trimming oversized context, caching repeated calls, and routing simple tasks to cheaper models. On a workload spending $30K-$150K a year on tokens, that is $6K-$60K back, which dwarfs the observability tooling cost.

The honest framing: buy AI observability for the debugging, but make the budget case on the token savings. The cost graph is the line item a finance reviewer signs off on without a second meeting; the "we can debug faster" argument is harder to quantify.

A 90-day AI observability rollout plan

Tested phasing that ships value in week one and matures the quality and alerting layers over the quarter.

Days 1-14: tracing and cost capture

Wrap your LLM client with a tracing SDK or route calls through a gateway. Capture prompts, completions, tool calls, latency per step, and token cost per request. No evals yet. Goal: get full forensic visibility so the next "the bot said something weird" report is a two-minute trace lookup instead of a guessing game. Time-to-value: one to two days for tracing, the rest of the window to instrument every call path.

Days 15-45: build a golden set and offline evals

Assemble a representative dataset of inputs with expected behavior, then write scoring criteria for correctness, groundedness, and safety. Wire these evals into CI so every prompt or model change is scored before it ships. This is the phase that takes real work; a useful golden set is two to four weeks of curation, not an afternoon.

Days 46-75: online evals and dashboards

Turn on sampled scoring of live production traffic, because offline sets never cover the full distribution of real inputs. Build the trend dashboards: eval pass rate, hallucination rate, token cost per request, latency per step. By the end of this phase you should be able to see quality and cost trends day-over-day, not just react to complaints.

Days 76-90: alerting and the remediation link

Wire eval thresholds to alerting (groundedness drop, hallucination spike, cost jump), and define the remediation path: how to roll back a prompt, switch a model route, or disable a tool, with the action logged. This is the step that turns observability from a dashboard into a reliability system, and it is where AI observability meets AI incident response.

Skipping straight to alerting without the tracing and eval foundation produces noisy, low-trust alerts nobody acts on. The phasing exists so each alert, when it finally fires, carries the evidence and the score behind it.

Frequently asked questions

What is AI observability?
AI observability is the practice of capturing, tracing, and evaluating the behavior of LLM and agent applications so you can debug, improve, and operate them in production. It extends classic observability (metrics, logs, traces) with AI-specific signals: prompts, completions, token usage, latency per step, tool calls, retrieval context, and eval scores for correctness, groundedness, and hallucination. The goal is the same as ever, answer why the system behaved the way it did, but the questions are now about model decisions, not just request paths.
How is AI observability different from traditional observability?
Traditional observability answers deterministic questions: which service is slow, which request threw a 500, where did the request spend its time. AI observability has to answer probabilistic ones: was the answer correct, was it grounded in retrieved context, which prompt version regressed quality, how much did this conversation cost in tokens. The classic three pillars (metrics, logs, traces) still apply to the infrastructure, but on top of them you need prompt and completion capture, per-step LLM tracing, token and cost accounting, and offline plus online evals. Same discipline, new signals.
Why can't I debug an LLM app with Datadog alone?
Datadog and similar APM tools are excellent at the infrastructure layer: latency, error rates, resource usage, and distributed traces of your service calls. But an LLM failure is usually not an infrastructure failure. The service returned a 200 OK in 800ms; it just returned a wrong, ungrounded, or unsafe answer. To see that, you need the prompt, the retrieved context, the completion, and an eval score attached to the span, which generic APM does not capture by default. The fix is layering LLM-native tracing and evals on top of your existing APM, not replacing it.
What are the best AI observability tools in 2026?
The 2026 landscape splits into LLM tracing and eval platforms (LangSmith, Arize Phoenix, Langfuse), gateway and cost-observability tools (Helicone, Portkey), and the classic APM layer underneath (Datadog, Grafana, OpenTelemetry). Most production teams run one tracing/eval tool plus their existing APM, wired together with OpenTelemetry GenAI semantic conventions so spans correlate. The right pick depends on whether you prioritize open-source self-hosting (Phoenix, Langfuse), managed eval depth (LangSmith), or proxy-based cost control (Helicone, Portkey).
What new signals does AI observability capture?
Seven that classic observability does not: the full prompt (system, user, and few-shot context), the completion or tool call the model produced, token usage split by input and output, latency per step in a multi-step chain or agent loop, the tool calls and their arguments and results, the retrieved context for RAG apps, and eval scores for correctness, groundedness, hallucination, and safety. These attach to spans so you can replay any single LLM call end to end.
How does AI observability connect to incident response?
Observability is the detection and diagnosis layer; incident response is what happens after. An eval score crossing a threshold (groundedness drops, hallucination rate spikes, cost per request jumps) becomes an alert. The trace that triggered it becomes the diagnosis evidence: the exact prompt, context, and completion are right there. From there an agentic platform can correlate the regression to a prompt change or model version, then remediate by rolling back the prompt, switching the model route, or disabling a tool, all within a policy envelope and recorded in an audit ledger.
What should an AI observability checklist include?
A 10-point checklist: (1) capture full prompts and completions, (2) trace every step of multi-step chains and agent loops, (3) account for token usage and cost per request, (4) version and diff prompts, (5) run offline evals in CI before deploy, (6) run online evals on live traffic, (7) detect hallucination and groundedness regressions, (8) correlate AI spans with infrastructure traces via OpenTelemetry, (9) wire eval thresholds to alerting, and (10) connect alerts to a remediation and audit path.
What is the cost of AI observability?
Two cost lines. Tooling: open-source tracing (Phoenix, Langfuse self-hosted) is near-free in license but costs engineering time to run, while managed platforms run roughly $0 to a few thousand dollars per month at startup volume scaling with trace count. The bigger line is the token cost AI observability lets you control: teams that add cost-per-request tracing routinely cut LLM spend 20-40% by catching runaway prompts, oversized context, and the wrong model handling simple tasks. The observability spend pays for itself in token savings well before it pays for itself in debugging time.
Do I need evals or is tracing enough?
Tracing tells you what happened; evals tell you whether it was good. Tracing alone lets you replay a bad answer after a user complains, which is reactive. Evals, scoring correctness, groundedness, and safety either offline in CI or online on sampled live traffic, are what let you catch a quality regression before users do. Production AI observability needs both: tracing for the forensic detail and evals for the quality signal that drives alerts.
How long does it take to set up AI observability?
Basic tracing is fast: wrapping your LLM client with a tracing SDK or routing through a gateway is usually a day or two. Useful evals take longer because you need a representative dataset and scoring criteria, typically two to four weeks to build a golden set and wire offline evals into CI. Full online evals with alerting and a remediation path is a quarter-long effort. Most teams ship tracing in week one and mature the eval and alerting layers over the following 90 days.
What metrics should I track for AI observability?
Five that matter: eval pass rate (correctness and groundedness over time), hallucination rate on sampled traffic, token cost per request and per conversation, latency per step in agent and chain workflows, and the rate of prompt or model regressions caught before release versus in production. Skip vanity metrics like raw trace volume, which measure activity, not quality.

Go deeper into the LLM operations stack: LLMOps and the full lifecycle of shipping and running LLM apps; AI incident response across the incident lifecycle; AI SRE and the broader reliability practice; and the AI engineer's guide to production reliability for teams shipping AI systems. For the platform that ties observability to remediation, see the Nova features.

See your LLM traces, evals, and infra signals in one place.

Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams that turn observability signals into detection, diagnosis, and auto-remediation across AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.