What is AI observability?
AI observability is the practice of capturing, tracing, and evaluating the behavior of LLM and agent applications so you can debug, improve, and operate them in production. It extends the classic observability discipline with a set of AI-specific signals: the prompts and completions, the token usage, the latency of each step in a chain, the tool calls an agent makes, the context retrieved for a RAG query, and eval scores that judge whether the output was correct, grounded, and safe. The mission is unchanged from the day observability was coined: answer why the system behaved the way it did. The questions are just new. They are now about model decisions, not only request paths.
The reason this is its own category, rather than a feature of your existing monitoring, is that LLM apps break a foundational assumption of the three pillars. Metrics, logs, and traces were designed for deterministic software: the same input produces the same output, and a failure shows up as an error, a timeout, or a saturated resource. An LLM produces a different output for the same input, and its worst failures are silent. The service returns a clean 200, the latency is fine, no exception is logged, and the answer is confidently wrong. None of the classic pillars sees that, because nothing technically failed.
So AI observability is not a replacement for your APM. It is a layer on top of it that captures the model's reasoning and judges its quality. If you are building or operating LLM systems, this sits next to the broader practice of LLMOps, the operational discipline for shipping and running LLM apps. Observability is the part of LLMOps that answers "is it working, and if not, why."
Traditional observability vs the new AI signals
The three pillars (metrics, logs, traces) still apply, to the infrastructure your model runs on. What changes is that a whole second layer of signal sits on top, capturing what the model actually did. The table below maps the classic question to its AI-era counterpart.
| Question | Traditional observability | AI observability adds |
|---|---|---|
| What happened? | Logs and traces of service calls | Prompt, completion, and tool calls per request |
| How slow was it? | Request latency, p50/p95/p99 | Latency per step in a chain or agent loop |
| What did it cost? | CPU, memory, infra spend | Token usage split by input and output |
| Was it correct? | Not captured (200 OK looks healthy) | Eval scores for correctness and groundedness |
| Did it hallucinate? | Invisible to APM | Hallucination and groundedness scoring |
| Which version regressed? | Deploy markers on a graph | Prompt and model version diff on the span |
| What did the agent retrieve? | Not modeled | Retrieved context attached to the trace |
| Is the infra healthy? | Metrics and traces (still the source of truth) | Correlated via OpenTelemetry, not replaced |
Read top-to-bottom, the pattern is clear: the infrastructure questions stay with classic observability; the model questions need new signals. The two are not competitors. The best setups correlate them so a single trace shows both the service path and the model's reasoning, stitched together by OpenTelemetry's GenAI semantic conventions.
The one-line distinction. Traditional observability answers deterministic questions (which service is slow, which request errored). AI observability answers probabilistic ones (was the answer correct, was it grounded, which prompt version regressed quality, how much did this conversation cost). Same discipline, new signals. You need both layers, wired together.
Why you cannot debug an LLM app with Datadog alone
Datadog, Grafana, New Relic, and the rest of the APM field are genuinely excellent at what they were built for: latency, error rates, resource saturation, and distributed traces of your service calls. If your problem is an infrastructure problem, these tools find it fast. The trouble is that an LLM failure is usually not an infrastructure failure.
Picture the failure mode. A user asks your support assistant a question. The retrieval step pulls the wrong documents. The model, working from bad context, produces a fluent, confident, completely incorrect answer. The HTTP response is a 200. The latency is 800ms, well inside SLO. No exception is thrown, no log line says "error," no metric moves. Every dashboard is green. And the answer was wrong in a way that, at scale, erodes user trust or creates real liability.
To catch that, you need four things attached to the span that generic APM does not capture by default:
1The prompt
The full assembled prompt: system instructions, user input, few-shot examples, and any injected context. Without it you are debugging blind, because the model's behavior is a function of the prompt you cannot see.
2The retrieved context
For RAG apps, the chunks the retriever returned. Most "the model hallucinated" bugs are actually "the retriever fetched the wrong context" bugs, and you cannot tell them apart without the retrieval payload on the trace.
3The completion
The exact output the model produced, including tool calls and their arguments. For multi-step agents, every intermediate step, because a final wrong answer often traces back to a single bad step five hops earlier.
4An eval score
A judgment of quality (correctness, groundedness, safety) attached to the span. This is the signal that turns "200 OK" into "200 OK but the groundedness score was 0.2," which is the line generic APM cannot draw.
The fix is not to rip out Datadog. The fix is to layer LLM-native tracing and evals on top of it, and correlate the two through OpenTelemetry so one trace tells the whole story. Teams that try to force-fit an LLM app into pure infrastructure monitoring end up flying blind on exactly the failures that matter most. If you operate the systems that run on this telemetry, see how it fits into AI SRE and the broader reliability practice.
See LLM traces, evals, and infra signals correlated in one view, end to end.
Try Nova →The 2026 AI observability landscape
The market splits into three layers. Most production teams run one tool from the tracing-and-eval layer, optionally a gateway, and their existing APM underneath, all stitched together with OpenTelemetry. The architectural test below is how to choose within each layer.
Layer 1: LLM tracing and eval platforms
These are the heart of AI observability: capture prompts, completions, and tool calls; trace multi-step chains and agent loops; run evals offline and online. Examples: LangSmith (managed, deep eval tooling, tight LangChain integration but framework-agnostic via its SDK), Arize Phoenix (open-source, OpenTelemetry-native, strong for self-hosting and trace analysis), and Langfuse (open-source, self-hostable, good prompt management and cost tracking). The strength is they speak the LLM domain natively; the tradeoff is you still need an infra APM underneath for the deterministic layer.
Layer 2: Gateways and cost observability
Proxy-based tools that sit in front of your model calls and capture usage, latency, and cost without code changes deep in the app. Examples: Helicone (one-line proxy, request logging, cost analytics, caching) and Portkey (gateway with routing, fallbacks, and observability). The strength is near-zero integration effort and excellent cost visibility; the tradeoff is the proxy sees the request boundary, not always the full internal step graph of a complex agent, so pair it with Layer 1 for deep tracing.
Layer 3: The classic APM underneath
Your existing infrastructure observability: Datadog, Grafana (with Tempo/Loki/Prometheus), and the OpenTelemetry collector that increasingly glues all three layers together. This layer does not go away. It is the source of truth for the deterministic part of the system, and OpenTelemetry's GenAI semantic conventions are what let an LLM span from Layer 1 correlate with a service trace here.
The right pick depends on your priorities: open-source self-hosting (Phoenix, Langfuse), managed eval depth (LangSmith), or proxy-based cost control (Helicone, Portkey). Whatever you choose, standardize on OpenTelemetry so the layers correlate instead of living in three disconnected dashboards. For the full operational picture this observability stack feeds, see our guide to the Nova platform features.
How observability connects to detection and remediation
Observability that only renders dashboards is half a system. The other half is what happens when a signal goes bad. AI observability is the detection and diagnosis layer; incident response is everything after.
The connection works in three moves. First, an eval threshold becomes an alert. Groundedness drops below 0.7 on sampled traffic, hallucination rate doubles week-over-week, or cost per request jumps after a deploy. Each of those is a numeric signal you can wire to alerting exactly like a latency SLO. Second, the trace becomes the diagnosis evidence. Because the prompt, retrieved context, completion, and eval score are all attached to the span that fired the alert, the on-call engineer (or an agent) opens the alert and the smoking gun is right there. No reproduction step, no "can you get me the prompt," the forensic detail is already captured.
Third, the diagnosis drives a remediation. A quality regression usually correlates to a small set of causes: a prompt change, a model version bump, a retriever config change, or a misbehaving tool. An agentic platform can correlate the regression to the offending change and remediate inside a policy envelope: roll back the prompt to the last known-good version, switch the model route, or disable a flaky tool, with every action recorded in an audit ledger. This is exactly the loop described in our guide to AI incident response across the incident lifecycle: detect from the eval signal, diagnose from the trace, remediate within policy, audit the action.
The takeaway: do not treat AI observability as a separate project from reliability. Wire eval thresholds to your alerting from day one, and make sure the trace that fires the alert carries the evidence the responder needs. Observability without that link is a museum of dashboards nobody acts on.
The 10-point AI observability checklist
Use this to audit your current setup or evaluate a tool. A mature AI observability practice answers all 10 concretely; gaps tell you exactly where you are flying blind.
- Do you capture full prompts and completions? Not a redacted summary, the actual assembled prompt and the actual output, attached to every LLM span.
- Do you trace every step of multi-step chains and agent loops? A final wrong answer often traces to a single bad intermediate step; you need each one as its own span.
- Do you account for token usage and cost per request? Split by input and output tokens, rolled up per conversation and per feature, not just a monthly bill.
- Do you version and diff prompts? When quality regresses, can you see exactly which prompt version shipped and what changed from the last good one?
- Do you run offline evals in CI before deploy? A golden dataset scored on every change, so a quality regression is caught before it reaches users.
- Do you run online evals on live traffic? Sampled scoring of production responses, because offline sets never cover the full distribution of real inputs.
- Do you detect hallucination and groundedness regressions? A specific score for "is this answer supported by the retrieved context," trended over time.
- Do AI spans correlate with infrastructure traces? Via OpenTelemetry, so one trace shows both the model reasoning and the service path it ran on.
- Are eval thresholds wired to alerting? A groundedness drop or cost spike should page or notify, exactly like a latency SLO breach.
- Do alerts connect to a remediation and audit path? The trace carries the evidence, and there is a defined way to roll back the prompt, switch the model, or disable a tool, all logged.
The economics: tooling cost vs token savings
AI observability is usually pitched as a debugging tool. The stronger financial case is that it pays for itself in token savings before it pays for itself in saved engineering time.
Cost line 1: tooling. Open-source tracing (Arize Phoenix or Langfuse, self-hosted) is near-free in license but costs engineering time to run and scale. Managed platforms run from roughly $0 on a free tier up to a few thousand dollars per month at startup volume, scaling with trace count and retention. For most teams the tooling spend is a rounding error next to the model bill.
Cost line 2 (the big one): the token spend observability lets you control. Once you have cost-per-request tracing, the waste becomes visible: a prompt that grew to 12K tokens of context when 2K would do, a verbose system prompt copied into every call, a simple classification task being routed to the most expensive model. Teams that add this visibility routinely cut LLM spend 20-40% in the first quarter just by trimming oversized context, caching repeated calls, and routing simple tasks to cheaper models. On a workload spending $30K-$150K a year on tokens, that is $6K-$60K back, which dwarfs the observability tooling cost.
The honest framing: buy AI observability for the debugging, but make the budget case on the token savings. The cost graph is the line item a finance reviewer signs off on without a second meeting; the "we can debug faster" argument is harder to quantify.
A 90-day AI observability rollout plan
Tested phasing that ships value in week one and matures the quality and alerting layers over the quarter.
Days 1-14: tracing and cost capture
Wrap your LLM client with a tracing SDK or route calls through a gateway. Capture prompts, completions, tool calls, latency per step, and token cost per request. No evals yet. Goal: get full forensic visibility so the next "the bot said something weird" report is a two-minute trace lookup instead of a guessing game. Time-to-value: one to two days for tracing, the rest of the window to instrument every call path.
Days 15-45: build a golden set and offline evals
Assemble a representative dataset of inputs with expected behavior, then write scoring criteria for correctness, groundedness, and safety. Wire these evals into CI so every prompt or model change is scored before it ships. This is the phase that takes real work; a useful golden set is two to four weeks of curation, not an afternoon.
Days 46-75: online evals and dashboards
Turn on sampled scoring of live production traffic, because offline sets never cover the full distribution of real inputs. Build the trend dashboards: eval pass rate, hallucination rate, token cost per request, latency per step. By the end of this phase you should be able to see quality and cost trends day-over-day, not just react to complaints.
Days 76-90: alerting and the remediation link
Wire eval thresholds to alerting (groundedness drop, hallucination spike, cost jump), and define the remediation path: how to roll back a prompt, switch a model route, or disable a tool, with the action logged. This is the step that turns observability from a dashboard into a reliability system, and it is where AI observability meets AI incident response.
Skipping straight to alerting without the tracing and eval foundation produces noisy, low-trust alerts nobody acts on. The phasing exists so each alert, when it finally fires, carries the evidence and the score behind it.
Frequently asked questions
What is AI observability?
How is AI observability different from traditional observability?
Why can't I debug an LLM app with Datadog alone?
What are the best AI observability tools in 2026?
What new signals does AI observability capture?
How does AI observability connect to incident response?
What should an AI observability checklist include?
What is the cost of AI observability?
Do I need evals or is tracing enough?
How long does it take to set up AI observability?
What metrics should I track for AI observability?
Related guides
Go deeper into the LLM operations stack: LLMOps and the full lifecycle of shipping and running LLM apps; AI incident response across the incident lifecycle; AI SRE and the broader reliability practice; and the AI engineer's guide to production reliability for teams shipping AI systems. For the platform that ties observability to remediation, see the Nova features.
See your LLM traces, evals, and infra signals in one place.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams that turn observability signals into detection, diagnosis, and auto-remediation across AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.