Cluster Monitoring Coverage
What to monitor on every cluster.
Control plane
Cluster monitoring coverage is the discipline of monitoring all the layers of a Kubernetes cluster. Without comprehensive coverage, gaps appear; some failure mode goes undetected; the team's first signal is customer impact. With coverage, every layer has its own monitoring; failures surface from the layer that fails first.
What control plane monitoring covers:
- API latency.: The API server's response latency is monitored. Increasing latency indicates control plane stress; the cluster's responsiveness is at risk; the team is alerted.
- Etcd latency.: etcd is the cluster's source of truth. etcd latency directly affects API server latency; etcd issues are foundational; monitoring it produces early warning of cluster issues.
- Scheduler queue.: Pods waiting in the scheduler queue indicate scheduling problems. The queue depth metric catches situations where the scheduler cannot keep up.
- Health of cluster operations.: The control plane metrics together describe whether cluster operations are healthy. Slow control plane equals slow operations; the metrics are the leading indicator.
- Per-component health.: Each control plane component (api-server, controller-manager, scheduler, etcd) has its own health metrics. The component-level granularity enables targeted investigation.
Control plane monitoring is the foundation. Without it, control plane issues produce mysterious symptoms.
Nodes
Node monitoring captures the cluster's compute substrate. Node-level issues affect the pods running on them; node-level monitoring catches the issues before they become pod incidents.
- CPU.: Each node's CPU usage is monitored. CPU saturation produces pod throttling; the metric catches this; capacity decisions reference it.
- Memory.: Node memory is monitored. Memory pressure produces evictions; the metric is a leading indicator; the team has time to respond before evictions cascade.
- Disk.: Node disk usage is monitored. Disk full produces ephemeral storage evictions; the trend metric (fill rate) catches it before the threshold; remediation is timely.
- Kubelet health.: The kubelet on each node has its own health metrics. Kubelet failures affect every pod on the node; the monitoring catches kubelet issues directly.
- Per-node and aggregate.: Both per-node detail and cluster-wide aggregates matter. Outliers (one node with issues) need per-node visibility; trends (overall capacity) need aggregates.
Node monitoring is the operational layer. The metrics here drive capacity and operational decisions.
Pods
Pod monitoring captures the workload-level signal. Pod issues are the most user-visible failures; pod monitoring catches them at the right granularity.
- Restart counts.: Pod restarts indicate problems. CrashLoopBackOff, OOMs, image-pull failures all show up as restart counts. The metric is the canary for pod health.
- Eviction events.: Pod evictions are recorded. The reasons (memory pressure, disk pressure, voluntary disruption) drive different responses; the events are the data.
- Resource utilization.: Per-pod CPU and memory usage are monitored. Comparing to requests and limits drives right-sizing; the data is the foundation for cost optimization.
- Per-namespace and aggregate.: Per-namespace views support per-team accountability. Aggregate views show cluster-wide patterns. Both are useful; the combination is comprehensive.
- Workload-specific metrics.: Beyond Kubernetes-level metrics, the team monitors workload-specific signals. Application metrics, business metrics, custom indicators all complement the cluster-level monitoring.
Cluster monitoring coverage is one of those operational disciplines that pays off across many incidents and many years. Nova AI Ops integrates with cluster telemetry across all layers, surfaces patterns and anomalies, and produces the comprehensive view that the platform team uses to operate the cluster effectively.