Noisy Neighbor Alerts

Multi-tenant systems: one tenant impacts others. Alert on it.

What a noisy neighbor is

Noisy neighbors are workloads that share infrastructure: one workload spikes, the others see latency, errors, or throttling. Common in Kubernetes (CPU and memory contention), shared databases (lock contention), and shared network (bandwidth). Noisy-neighbor alerts catch the contention before users see it.

Shared infrastructure. Multiple workloads on the same node, database, or network; one workload spike affects all neighbors.
Common surfaces. Kubernetes nodes (CPU/memory), shared databases (lock contention), shared network (bandwidth saturation).
User-visible before contention. Latency and errors precede outright failure; the contention surfaces at the application layer first.
Predictive value. Catching contention early avoids the cascade that follows when workloads start retrying against degraded neighbors.

What to alert on

The signals are well-understood. CPU throttling, memory pressure, database lock waits, connection pool exhaustion, network packet drops; each maps to a specific contention class, and each has a threshold that catches the contention before users do.

CPU throttling. container_cpu_cfs_throttled_seconds_total above 5% for 10 minutes is contention.
Memory pressure. node_memory_MemAvailable_bytes per node below 10% triggers scheduler pressure.
Database contention. Lock waits, connection pool exhaustion; surfaces shared-database contention before the app times out.
Network drops. Packet drops on the node interface; surfaces bandwidth contention before the app sees retransmissions.

Attributing the noise

An alert that says "a pod is throttled" is unhelpful. The on-call needs "pod X on node Y is throttling pod Z." Attribution is built from cgroup metrics plus node-level metrics, and the top-N consumers per node are surfaced in the alert payload so the on-call has a starting point in 30 seconds.

Per-pod cgroup metrics. CPU and memory consumption attributed per pod; the noisy pod is identified, not just the throttled one.
Node-level top-N. Top 5 CPU and memory consumers per node; the contention attribution is one query.
Surface in alert payload. The on-call sees the top consumers in the page; the starting point is data, not investigation.
Per-node trend view. Top consumers tracked over time; supports investigation when the noise is intermittent.

Common remediations

The remediations are well-understood: resource requests and limits, workload class separation, QoS classes. Each addresses a different aspect of contention, and the discipline is to apply them in sequence rather than as one-off responses to incidents.

Resource requests and limits. Without them, the scheduler cannot prevent contention; the baseline that makes the rest of the controls work.
Workload class separation. Bin-pack noisy workloads on dedicated node pools; separates the contention domain.
QoS classes. Guaranteed > Burstable > BestEffort; critical workloads run as Guaranteed to survive contention.
Per-tenant isolation. Hard limits on namespace resource consumption; supports multi-tenant clusters where the contention attribution becomes a billing question.

When to invest

Noisy-neighbor work is a real investment, and not every cluster needs it. Multi-tenant clusters and shared databases need attribution; single-tenant nodes do not. Above 50 services on shared infrastructure, attribution becomes load-bearing and the investment pays back fast.

Multi-tenant clusters. Attribution is required; without it, contention complaints have no starting point.
Above 50 shared services. Attribution becomes load-bearing; the cost of investigating without it grows faster than the cluster.
Single-tenant nodes skip. No noisy-neighbor problem; the investment does not pay back.
Start with CPU throttling. Cheap, accurate, explains 70% of latency complaints from app teams; the highest-leverage first alert.