The Multi-Agent OS for SRE & DevOps

Kubernetes Monitoring: The Complete Guide (2026)

Kubernetes does not break the way a server breaks. Pods come and go, the scheduler reshuffles workloads, and a single bad deploy can ripple through four layers of abstraction before a human notices. This is the complete 2026 guide to Kubernetes monitoring: the layers to watch, the metrics that actually catch incidents, the Prometheus toolchain, the classic failure modes, a 10-point checklist, and a 90-day rollout plan.

17 min read Published May 2026 By Dr. Samson Tanimawo, Nova AI Ops
Kubernetes monitoring diagram showing the container, pod, node, and cluster layers with Prometheus, kube-state-metrics, and Nova AI Ops correlating pod and node signals into one incident

Why Kubernetes monitoring is hard and different

Kubernetes monitoring is not just monitoring with extra steps. The thing you are watching is designed to be impermanent. A pod is born when the scheduler places it, lives for as long as the deployment wants, and is deleted the moment a rollout, a node drain, or an autoscaler decides otherwise. The host you alerted on at 3 p.m. may not exist at 3:05. Traditional monitoring assumes a stable inventory of named machines you can point a check at. Kubernetes breaks that assumption on purpose.

If you are coming from classic host or VM monitoring, our general monitoring guide covers the foundations: choosing signals, setting thresholds, and building dashboards for any system. Everything there still applies. This guide is the Kubernetes-specific layer on top, the part that the general guide deliberately does not go deep on, because the failure modes and the entity model are different enough to deserve their own page. Think of observability as the discipline of being able to ask new questions of your system, and Kubernetes monitoring as applying that discipline to a platform whose defining property is constant change.

Four properties make Kubernetes uniquely hard to monitor. First, ephemerality. Pods are cattle, not pets. You cannot identify a workload by hostname because the hostname changes on every restart. You identify it by labels (app, version, team) and let the monitoring system aggregate across the churning set of pods that share those labels. Second, dynamic scheduling. The scheduler moves pods between nodes in response to pressure, taints, and bin-packing decisions you did not make. A latency spike might be a noisy neighbor that landed on the same node ten seconds ago, not anything wrong with your code.

Third, layered abstractions. A single request passes through a container, inside a pod, scheduled onto a node, that belongs to a cluster, possibly behind a service mesh. A problem can originate at any layer and present at any other. Memory pressure on a node evicts a pod; a bad image makes a container crash-loop; an etcd hiccup stalls the whole control plane. Monitoring has to span all four layers and let you trace a symptom down to its layer. Fourth, cardinality. Every metric is multiplied by pod, container, namespace, node, and deploy hash. A modest cluster easily produces millions of unique time series, and that explosion is both a cost problem and a query-performance problem that traditional monitoring never had to face.

The one-sentence version. Traditional monitoring asks "is this named server healthy?" Kubernetes monitoring asks "across this constantly-churning set of labeled, scheduled, layered things, is the workload the user cares about healthy, and if not, which layer is to blame?"

The layers to monitor: container to cluster

The single most useful mental model for Kubernetes monitoring is the stack of layers, because every alert ultimately resolves to one of them. Monitor all five and you can trace any symptom to its source. Skip a layer and you get blind spots that turn into 2 a.m. mysteries.

Layer What you watch Primary source
ContainerCPU, memory, throttling, restarts per containercAdvisor (in kubelet)
PodPhase, readiness, restart reason, OOMKilled, schedulingkube-state-metrics
WorkloadDesired vs ready replicas, rollout progress, HPA statekube-state-metrics
NodeAllocatable vs used, memory/disk/PID pressure, kubelet healthnode-exporter + kubelet
Control planeAPI server latency/errors, etcd health, scheduler, controllersAPI server + etcd metrics

Container and pod

The container is the lowest layer with its own resource accounting: how much CPU and memory this one process group is using, whether it is being throttled, and how many times it has restarted. cAdvisor, built into the kubelet, exposes these. The pod wraps one or more containers and adds Kubernetes-level state: is it Pending, Running, or Failed; is it Ready (passing readiness); why did it last restart. That last-state reason, especially OOMKilled or a CrashLoopBackOff waiting reason, comes from kube-state-metrics and is some of the highest-signal data in the whole system.

Workload: deployment, statefulset, daemonset

One pod tells you about one replica. The workload object tells you whether the service is healthy: a deployment that wants 6 replicas but has only 3 ready is half-down even if those 3 pods look perfectly fine individually. Watch desired versus ready replicas, rollout progress (is a new version actually coming up, or is it stuck), and horizontal pod autoscaler state (is it pinned at max because it cannot keep up). For statefulsets and daemonsets the same logic applies with their own ordering and per-node guarantees.

Node and control plane

Nodes are the machines, real or virtual, that actually run the pods. Watch allocatable versus used CPU and memory (a node packed to its limit cannot schedule new pods), and the node pressure conditions: MemoryPressure, DiskPressure, PIDPressure. A node under MemoryPressure will start evicting pods, which looks like random pod death until you correlate it to the node. The control plane, finally, is the part teams most often forget to monitor: the API server (latency and error rate, because everything talks to it), etcd (the cluster's source of truth, sensitive to disk latency), the scheduler, and the controller manager. When the control plane degrades, every other signal gets unreliable, so it belongs on the dashboard even on managed clusters where the cloud runs it for you.

Key Kubernetes metrics and signals

You could collect thousands of Kubernetes metrics. A small set does most of the incident-catching work. Here are the ones that earn their place.

Resource requests and limits versus actual usage

This is the highest-signal pairing in Kubernetes monitoring, because it explains two of the most common pain classes at once. A container's request is what the scheduler reserves for it; its limit is the ceiling the kernel enforces. Set requests too low and the scheduler over-packs nodes, causing contention. Set the CPU limit too low and the kernel throttles the container, adding latency that looks like a code problem but is really a config problem. Set the memory limit too low and the pod gets OOMKilled. Always graph usage against both request and limit; the gaps are where your incidents and your wasted spend both live. This is also where Kubernetes monitoring feeds directly into capacity planning.

CPU throttling

CPU throttling is the most commonly missed Kubernetes signal. When a container hits its CPU limit, the kernel does not kill it; it simply pauses it until the next scheduling period. The application sees mysterious latency with no error, no crash, and no obvious cause. The metric (container_cpu_cfs_throttled_periods over total periods) tells you immediately. Any container being throttled more than a few percent of the time has a CPU limit that is too low for its real workload.

Memory and OOMKilled

Memory does not throttle. When a container crosses its memory limit, the Linux kernel's out-of-memory killer terminates it, and Kubernetes records the last-state reason as OOMKilled before restarting it. Watch the memory working set against the limit, and alert on the OOMKilled reason directly. A pod that is OOMKilled once might have hit a transient spike; a pod OOMKilled repeatedly is either under-provisioned or leaking, and the fix differs, so the alert should surface the trend.

Pod restarts and CrashLoopBackOff

A rising restart count is one of the earliest signs of trouble. When a container fails repeatedly, Kubernetes enters CrashLoopBackOff, restarting it with exponential backoff and surfacing that as a waiting reason. Catching the restart count climbing and the CrashLoopBackOff reason lets you react before the downstream availability alert ever fires.

Pending and unschedulable pods

A pod stuck in Pending means the scheduler cannot place it: not enough allocatable CPU or memory, a taint with no matching toleration, an unbound volume, or an anti-affinity rule it cannot satisfy. This is invisible to application-level monitoring because the pod never ran. kube-state-metrics exposes the pod phase, so a Pending pod that stays Pending for more than a couple of minutes should page.

Node pressure and etcd health

Node pressure conditions (MemoryPressure, DiskPressure, PIDPressure) predict pod evictions before they happen. etcd health (is it reaching quorum, what is its disk write latency, how big is its database) predicts control-plane instability. Both are leading indicators: by the time the symptom reaches your application, these have usually been red for a while. Watching them turns reactive pages into proactive ones.

See every layer of your clusters correlated into one health picture, automatically.

Try Nova →

The toolchain: Prometheus, kube-state-metrics, and friends

The de facto open-source Kubernetes monitoring stack has converged on a handful of components, usually deployed together as the kube-prometheus-stack bundle. Each does one job, and understanding the division of labor is what keeps you from collecting the same thing twice or missing it entirely.

The core components

  • Prometheus is the time-series database and the scraper. It pulls metrics from every exporter on a schedule, stores them, and answers PromQL queries. It is the hub everything else plugs into.
  • kube-state-metrics listens to the Kubernetes API and exposes object state as metrics: desired versus ready replicas, pod phase, container restart counts and termination reasons, node conditions, job status. It does not measure resource usage, it measures what the cluster declares and observes.
  • node-exporter runs on every node and exposes OS-level metrics: CPU, memory, disk, filesystem, and network for the machine itself, below the container boundary.
  • cAdvisor is built into the kubelet and exposes per-container resource usage: the CPU, memory, and throttling numbers for each running container.
  • metrics-server is the lightweight aggregator that powers kubectl top and the horizontal and vertical pod autoscalers. It holds live numbers only, not history, so it complements Prometheus rather than replacing it.
  • Grafana is the visualization layer that turns PromQL into the dashboards humans actually look at.

Increasingly, OpenTelemetry sits alongside this stack to carry traces and logs in a vendor-neutral format, so that a slow request can be followed across services rather than just counted. For the request-tracing side of the picture, see our guide to distributed tracing, and for monitoring service-to-service interactions specifically, microservices monitoring.

The cost and cardinality problem

The trap every team hits at scale is cardinality. Prometheus cost and performance scale with the number of active time series, not with raw data volume, and Kubernetes is a cardinality machine: every metric times pod name times container times namespace times node times deploy hash. A few hundred pods doing daily deploys can produce tens of millions of unique series, and the bill (or the self-hosted Prometheus memory footprint) climbs with it. The controls are well understood: drop high-cardinality labels you never query (a pod-hash label nobody filters on is pure cost), use recording rules to pre-aggregate the expensive queries, set retention deliberately rather than keeping everything forever, and sample traces instead of storing all of them. Cardinality discipline is the single biggest lever on monitoring cost in a large cluster.

Defining clusters and exporters as code, rather than clicking them into existence, keeps the monitoring stack reproducible across environments; this is where Kubernetes monitoring meets infrastructure as code and DevOps automation.

Health, probes, and the golden signals in K8s

Kubernetes has its own built-in health mechanism that predates any external monitoring you add: probes. Understanding them is essential, because a misconfigured probe is itself a common incident cause, and the probe state is a signal you should monitor.

Liveness, readiness, and startup probes

  • Liveness probe. Answers "is this container alive, or wedged?" If it fails, Kubernetes restarts the container. A too-aggressive liveness probe is a classic self-inflicted incident: a slow-but-healthy app gets killed and restarted in a loop.
  • Readiness probe. Answers "should this pod receive traffic right now?" If it fails, Kubernetes pulls the pod out of the service endpoints without killing it. A flapping readiness probe silently removes capacity, which looks like a performance problem with no crashing pods, so it must be on the dashboard.
  • Startup probe. Answers "has this slow-booting app finished starting?" It holds off the liveness checks until the app is up, preventing the restart loop that liveness alone would cause for legacy or JVM-heavy apps.

Probe state belongs on the same view as restart counts and OOMKills, because probe failures and resource problems are the two halves of most pod-level incidents.

The golden signals, applied to Kubernetes

The four golden signals (latency, traffic, errors, and saturation) translate cleanly onto Kubernetes, and they keep you focused on what users feel rather than drowning in raw pod metrics. Latency is request duration measured at the service or ingress, not pod CPU. Traffic is requests per second into the workload. Errors are HTTP 5xx and gRPC error rates, plus pod-level failures like restarts and OOMKills that predict user-facing errors. Saturation is the most Kubernetes-flavored of the four: how close are you to resource limits, node allocatable, and the throttling threshold. "What good looks like" is a dashboard where the golden signals sit at the top for the user view, and the layer metrics (container, pod, node, control plane) sit below for drill-down when a golden signal goes red. Catch a saturation trend early and you are doing anomaly detection rather than firefighting.

Common failure modes and how monitoring catches them

Kubernetes has a small set of recurring failure modes. Knowing them, and the exact signal each one produces, is most of practical Kubernetes monitoring. Wire these to alerts and you catch the large majority of cluster incidents early.

Failure mode What happened Signal that catches it
OOMKilledContainer exceeded its memory limitLast-state reason OOMKilled + memory near limit
CrashLoopBackOffContainer keeps failing on startRising restart count + waiting reason
ImagePullBackOffBad image tag, missing registry secretPod stuck waiting, ImagePullBackOff reason
Resource starvationNode packed, pods cannot schedulePending pods + node allocatable exhausted
Noisy neighborA pod hogs a shared nodeNode saturation high, one pod dominates usage
Failed rolloutNew version will not become readyDesired > ready replicas, rollout stalled

OOMKilled and CrashLoopBackOff are the two most common, and both surface through kube-state-metrics before any downstream availability alert. ImagePullBackOff is almost always a deploy mistake (a typo in the image tag, or a missing pull secret) and is caught by the same pod-waiting-reason metric. Resource starvation shows up as Pending pods plus a node whose allocatable capacity is exhausted; it is invisible to app-level checks because the pods never ran. Noisy neighbors are the sneaky one: your pod is slow, nothing is wrong with your code, and the cause is another tenant's pod saturating the shared node, visible only when you correlate node saturation against per-pod usage. Failed rollouts show up as a persistent gap between desired and ready replicas, which is exactly why workload-layer monitoring matters: the individual new pods may be crash-looping or OOMKilled, but the deployment-level "half my replicas are missing" view is what tells you the rollout is the problem.

Each of these is a clean candidate for automated remediation, which is the natural next step once monitoring is reliably catching them. A bad rollout has an obvious fix (roll back); a wedged pod has an obvious fix (restart); a node under pressure has an obvious fix (cordon and reschedule). The faster these connect from signal to action, the lower your MTTR and the calmer your incident management process.

From K8s signals to autonomous remediation

Here is the structural problem with Kubernetes monitoring at any real scale: a busy cluster generates more events than a human can read, let alone triage. Hundreds of pods, dozens of nodes, daily deploys, autoscalers reshuffling capacity, and a control plane emitting its own stream. During a single bad rollout you might get a CrashLoopBackOff alert, three OOMKilled alerts, a readiness-probe-failing alert, a desired-versus-ready-replicas alert, and a downstream latency alert, all describing one underlying problem. A human on call has to mentally correlate those six pages into one incident before they can even start fixing it.

This is the case for correlation and auto-remediation, and it is the same argument that drives AIOps and self-healing infrastructure generally, sharpened by the fact that Kubernetes produces signal faster than almost any other platform. The job is not to generate more alerts; it is to collapse many related signals into one incident, identify the layer at fault, and act.

This is where Nova AI Ops fits. Nova watches pods, nodes, and the control plane across clusters and across clouds, AWS, GCP, and Azure, with both Linux and Windows nodes, under one model. When a rollout goes bad, Nova correlates the pod-level CrashLoopBackOff, the OOMKilled events, the readiness-probe failures, and the downstream latency into a single incident rather than six pages. It finds the cause by reasoning across the layers (this deploy hash, this image, this resource limit) the way an experienced SRE would, but in seconds. And for routine cases inside a defined policy envelope, restarting a wedged pod, rolling back a bad deploy, cordoning a node under pressure, scaling a replica set, it can auto-resolve, while escalating the genuinely novel incidents to a human with the correlation and the evidence already attached.

Nova is an AI SRE platform built on an agentic architecture, so every autonomous action is bounded by policy, trust-scored per agent and per action, and written to an immutable audit ledger. That is the difference between "auto-remediation we are nervous about" and "auto-remediation we let run on routine Kubernetes incidents at 3 a.m." For teams operating clusters that also run AI workloads, this connects to LLMOps and AI-system reliability as well.

A 90-day rollout plan and a 10-point checklist

Standing up Kubernetes monitoring is a phased exercise, not a one-day install. The pattern below gets you to useful coverage fast, then layers on the depth that catches the subtle incidents.

Days 1–14: Deploy the core stack and get the layers visible

Install the kube-prometheus-stack: Prometheus, kube-state-metrics, node-exporter, and Grafana. Point it at one cluster. Goal: get the container, pod, workload, and node layers all rendering on a dashboard, and confirm metrics-server is feeding kubectl top. Do not write alerts yet, just make the data trustworthy. Most of this is configuration, and getting it reproducible as code now saves pain later.

Days 15–45: Wire the high-signal alerts

Add alerts for the failure modes that actually page: OOMKilled, CrashLoopBackOff, ImagePullBackOff, Pending pods that stay pending, desired-versus-ready replica gaps, node pressure, and CPU throttling above a threshold. Tune the thresholds against two weeks of real data so you are not paging on normal deploy churn. By the end of this phase your dashboard catches the common incidents and the noise is manageable.

Days 46–75: Add tracing, control-plane, and the golden-signal view

Layer in OpenTelemetry traces so you can follow a slow request across services, add control-plane monitoring (API server latency, etcd health) even on managed clusters, and build the top-level golden-signals dashboard that sits above the per-layer detail. Now an on-call engineer can start at "users are seeing latency" and drill down to the exact layer and pod.

Days 76–90: Correlation and autonomous remediation on routine cases

Connect the signals to action. Start with correlation, collapsing related pod, node, and rollout alerts into single incidents, then enable autonomous remediation inside a tight policy envelope on the safest patterns: restart a wedged pod, roll back a failed deploy on a non-critical service. Measure auto-resolution rate and the reduction in pages per engineer, and use that to justify expanding coverage to critical clusters.

  1. Are all five layers visible? Container, pod, workload, node, and control plane each have a metric source and a dashboard panel.
  2. Do you graph usage against both request and limit? Not just usage, the gaps to request and limit are where throttling and OOMKills live.
  3. Is CPU throttling alerted? The most commonly missed signal; latency with no errors is often throttling.
  4. Is OOMKilled alerted on the reason, not just memory? The termination reason from kube-state-metrics is the definitive signal.
  5. Do you catch CrashLoopBackOff and ImagePullBackOff by waiting reason? Before the downstream availability alert fires.
  6. Are Pending and unschedulable pods alerted? App monitoring cannot see a pod that never ran.
  7. Is node pressure monitored as a leading indicator? MemoryPressure, DiskPressure, and PIDPressure predict evictions.
  8. Is the control plane monitored, even on managed clusters? API server latency and etcd health degrade everything else when they slip.
  9. Is cardinality under control? Dropped unused labels, recording rules for expensive queries, deliberate retention.
  10. Do signals connect to action? Correlation into single incidents and a path to autonomous remediation within a policy envelope.

Work this checklist top to bottom and you move from "we have Prometheus installed" to "our Kubernetes monitoring actually catches and helps resolve incidents," which is the only version that matters at 3 a.m.

Frequently asked questions

Why is Kubernetes monitoring harder than traditional host monitoring?
Because the thing you are monitoring will not exist in five minutes. Pods are ephemeral, the scheduler moves workloads between nodes constantly, and every metric carries a pod name and a deploy hash that churns on every rollout. Traditional host monitoring assumes a stable set of named machines; Kubernetes monitoring has to track entities by label rather than by hostname, tolerate high cardinality, and reason about four nested layers (container, pod, node, cluster) at once.
What metrics should I monitor in Kubernetes?
Start with the few that catch real incidents: CPU usage versus the request and limit (and the CPU throttling that limits cause), memory usage versus the limit (and the OOMKilled events when a pod crosses it), pod restart counts and CrashLoopBackOff, pending or unschedulable pods, node pressure conditions (memory, disk, PID), and control-plane health (API server latency and error rate, etcd health and disk latency). Resource requests and limits versus actual usage is the single highest-signal pairing because it explains both throttling and eviction.
What is the standard Kubernetes monitoring stack?
Prometheus as the time-series database and scraper, kube-state-metrics for the state of Kubernetes objects (deployments, pods, replica counts, conditions), node-exporter for node-level OS metrics, cAdvisor (built into the kubelet) for per-container resource usage, the metrics-server for the live numbers that power kubectl top and autoscaling, and Grafana for dashboards. OpenTelemetry increasingly carries traces and logs alongside. Most teams run this as the kube-prometheus-stack bundle.
What is OOMKilled in Kubernetes and how do I monitor for it?
OOMKilled means the Linux kernel terminated a container because it exceeded its memory limit. You see it as a pod restart with the reason OOMKilled in the container's last state. Monitor for it by watching container memory working set against the configured limit and by alerting on the OOMKilled termination reason directly from kube-state-metrics. A pod that is OOMKilled repeatedly is either under-provisioned on its memory limit or leaking memory, and the fix is different for each, so the alert should point you at the trend, not just the event.
What causes CrashLoopBackOff and how does monitoring catch it?
CrashLoopBackOff means a container keeps starting, failing, and being restarted, with Kubernetes backing off exponentially between attempts. Common causes are a bad config or missing secret, a failing migration, a dependency that is not ready, or an OOMKill on startup. Monitoring catches it through a rising restart count and the explicit waiting reason CrashLoopBackOff exposed by kube-state-metrics, which is far faster than waiting for a downstream availability alert to fire.
What is kube-state-metrics and why do I need it?
kube-state-metrics is a service that listens to the Kubernetes API and exposes the state of objects as Prometheus metrics: how many replicas a deployment wants versus has, whether a pod is pending or running, container restart counts, termination reasons, node conditions, and job status. It does not measure resource usage (that is cAdvisor and node-exporter); it measures the declared and observed state of the cluster. Without it you can see CPU and memory but not whether a deploy is stuck, a pod is unschedulable, or a container was OOMKilled.
How do liveness, readiness, and startup probes relate to monitoring?
Probes are how Kubernetes itself monitors a container. A liveness probe failing restarts the container; a readiness probe failing pulls the pod out of the service load balancer without killing it; a startup probe gives slow-booting apps time before liveness checks begin. They are health signals you should monitor on top of, because a flapping readiness probe silently removes capacity and a misconfigured liveness probe can cause restart loops. Probe state belongs on the same dashboard as restarts and OOMKills.
How does Kubernetes monitoring differ from general monitoring?
General monitoring, covered in our monitoring guide, is about choosing signals, setting alerts, and building dashboards for any system. Kubernetes monitoring applies all of that to a platform whose defining property is constant change: short-lived pods, label-based identity, layered abstractions, and a control plane that is itself a system to watch. The principles are the same; the entities, the cardinality, and the failure modes are Kubernetes-specific, which is why it deserves a dedicated guide rather than a paragraph in the general one.
Why does Kubernetes monitoring get so expensive at scale?
Cardinality. Every metric multiplied by pod name, container, namespace, node, and deploy hash explodes into millions of unique time series, and Prometheus cost scales with active series, not with raw volume. A few hundred pods churning through daily deploys can generate tens of millions of series. The controls are dropping high-cardinality labels you never query, using recording rules to pre-aggregate, setting sane retention, and sampling traces rather than keeping everything. Cardinality discipline is the main cost lever in a large cluster.
Can Kubernetes monitoring be automated with AI?
Yes, and at cluster scale it increasingly has to be, because a busy cluster generates more pod, node, and control-plane events than a human can triage. Nova AI Ops watches pods, nodes, and the control plane across clusters and clouds (AWS, GCP, Azure, plus Linux and Windows nodes), correlates a pod-level anomaly, a node-pressure signal, and a failing deploy into a single incident, finds the cause, and can auto-resolve routine cases such as restarting a wedged pod or rolling back a bad deploy within a policy envelope, escalating the novel cases to humans.

Start with the foundations this guide builds on: monitoring (the general parent guide) and observability (metrics, logs, and traces). For the signals and patterns: the four golden signals, microservices monitoring, anomaly detection, and distributed tracing. On the operational side: fighting alert fatigue, lowering MTTR, and running incident management well. On automation and the path to autonomy: AIOps, self-healing infrastructure, site reliability engineering, AI SRE, and Agentic SRE. On planning and platform discipline: SLOs and error budgets, capacity planning, infrastructure as code, DevOps automation, and for AI workloads LLMOps. See it all in action on the Nova AI Ops platform.

See Nova monitor your clusters across every cloud.

Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams, watching pods, nodes, and the control plane across AWS, GCP, Azure, Linux, and Windows. Free tier available for small teams.