Noisy Neighbor Alerts
Multi-tenant systems: one tenant impacts others. Alert on it.
What a noisy neighbor is
Multiple workloads sharing infrastructure. One workload spikes; the others see latency, errors, or throttling.
Common in Kubernetes (CPU/memory contention), shared databases (lock contention), and shared network (bandwidth).
Noisy neighbor alerts catch the contention before users see it.
What to alert on
CPU throttling: `container_cpu_cfs_throttled_seconds_total`. Above 5% throttling for 10 minutes is contention.
Memory pressure: `node_memory_MemAvailable_bytes` per node. Below 10% triggers scheduler pressure.
Database lock waits, connection pool exhaustion, network packet drops on the node.
Attributing the noise
An alert that says "a pod is throttled" is unhelpful. The on-call needs "pod X on node Y is throttling pod Z."
Build attribution from cgroup metrics + node-level metrics. Top consumers per node, ordered by CPU and memory.
Surface the top 5 consumers in the alert payload. The on-call has a starting point in 30 seconds.
Common remediations
Resource requests and limits. Without them, the scheduler cannot prevent contention.
Workload class separation. Bin-pack noisy workloads on dedicated node pools.
QoS classes in Kubernetes. Guaranteed > Burstable > BestEffort. Critical workloads run as Guaranteed.
When to invest
Multi-tenant clusters or shared databases. Single-tenant nodes don't have noisy-neighbor problems.
Above 50 services on shared infrastructure, attribution becomes load-bearing.
Start with CPU throttling alerts. They are cheap, accurate, and explain 70% of latency complaints from app teams.