Noisy Neighbor Alerts
Multi-tenant systems: one tenant impacts others. Alert on it.
What a noisy neighbor is
Noisy neighbors are workloads that share infrastructure: one workload spikes, the others see latency, errors, or throttling. Common in Kubernetes (CPU and memory contention), shared databases (lock contention), and shared network (bandwidth). Noisy-neighbor alerts catch the contention before users see it.
- Shared infrastructure. Multiple workloads on the same node, database, or network; one workload spike affects all neighbors.
- Common surfaces. Kubernetes nodes (CPU/memory), shared databases (lock contention), shared network (bandwidth saturation).
- User-visible before contention. Latency and errors precede outright failure; the contention surfaces at the application layer first.
- Predictive value. Catching contention early avoids the cascade that follows when workloads start retrying against degraded neighbors.
What to alert on
The signals are well-understood. CPU throttling, memory pressure, database lock waits, connection pool exhaustion, network packet drops; each maps to a specific contention class, and each has a threshold that catches the contention before users do.
- CPU throttling.
container_cpu_cfs_throttled_seconds_totalabove 5% for 10 minutes is contention. - Memory pressure.
node_memory_MemAvailable_bytesper node below 10% triggers scheduler pressure. - Database contention. Lock waits, connection pool exhaustion; surfaces shared-database contention before the app times out.
- Network drops. Packet drops on the node interface; surfaces bandwidth contention before the app sees retransmissions.
Attributing the noise
An alert that says "a pod is throttled" is unhelpful. The on-call needs "pod X on node Y is throttling pod Z." Attribution is built from cgroup metrics plus node-level metrics, and the top-N consumers per node are surfaced in the alert payload so the on-call has a starting point in 30 seconds.
- Per-pod cgroup metrics. CPU and memory consumption attributed per pod; the noisy pod is identified, not just the throttled one.
- Node-level top-N. Top 5 CPU and memory consumers per node; the contention attribution is one query.
- Surface in alert payload. The on-call sees the top consumers in the page; the starting point is data, not investigation.
- Per-node trend view. Top consumers tracked over time; supports investigation when the noise is intermittent.
Common remediations
The remediations are well-understood: resource requests and limits, workload class separation, QoS classes. Each addresses a different aspect of contention, and the discipline is to apply them in sequence rather than as one-off responses to incidents.
- Resource requests and limits. Without them, the scheduler cannot prevent contention; the baseline that makes the rest of the controls work.
- Workload class separation. Bin-pack noisy workloads on dedicated node pools; separates the contention domain.
- QoS classes. Guaranteed > Burstable > BestEffort; critical workloads run as Guaranteed to survive contention.
- Per-tenant isolation. Hard limits on namespace resource consumption; supports multi-tenant clusters where the contention attribution becomes a billing question.
When to invest
Noisy-neighbor work is a real investment, and not every cluster needs it. Multi-tenant clusters and shared databases need attribution; single-tenant nodes do not. Above 50 services on shared infrastructure, attribution becomes load-bearing and the investment pays back fast.
- Multi-tenant clusters. Attribution is required; without it, contention complaints have no starting point.
- Above 50 shared services. Attribution becomes load-bearing; the cost of investigating without it grows faster than the cluster.
- Single-tenant nodes skip. No noisy-neighbor problem; the investment does not pay back.
- Start with CPU throttling. Cheap, accurate, explains 70% of latency complaints from app teams; the highest-leverage first alert.