Saturation vs Utilization Alerts
Two types of resource alerts. Pick by what they catch.
Utilisation: trailing indicator
Utilisation describes what has been used. CPU at 80%, memory at 70%, disk at 60%. Useful for capacity planning and dashboards but a poor signal for alerting because by the time utilisation is high, the workload is already feeling it.
- Trailing signal. Numbers describing past consumption; the user impact is already in flight by the time the threshold trips.
- Static thresholds rot. The right percentage for one workload is wrong for another; one-size cutoffs produce noise.
- Trend, not page. Utilisation belongs on dashboards and weekly capacity reviews, not in the pager rotation.
- Capacity planning input. Multi-week trend lines forecast saturation; the value is forward planning, not real-time response.
Saturation: leading indicator
Saturation describes pressure on the resource. Queue depth growing, wait time increasing, throttle events firing. Saturation fires earlier than utilisation because pressure surfaces before the resource hits its ceiling, which is what makes it the right signal for predictive alerting.
- Pressure metrics. Queue depth, wait time, throttle events; surfaces the resource being stressed, not just consumed.
- Fires earlier than utilisation. CPU at 80% can be fine if the run-queue is empty; queue depth growing predicts trouble before any percent threshold trips.
- Pre-failure window. Saturation alerts buy minutes of lead time before user-visible failure; the predictive monitoring foundation.
- Workload-agnostic. Pressure thresholds transfer between workloads better than utilisation thresholds; the metric is the contention, not the consumption.
Layer them in alerts
Saturation pages, utilisation informs. The two signals catch different failure modes and belong in different alert tiers. Together they cover incipient overload (saturation) and sustained capacity erosion (utilisation) without mixing the two roles.
- Saturation pages. Queue depth above threshold for 5 minutes; the on-call gets the lead time to act.
- Utilisation notifies. Dashboard panels and business-hours summaries; trends inform planning, not response.
- Different failure modes. Saturation for incipient overload, utilisation for slow capacity erosion; the layering matches the failure-mode separation.
- Per-resource policy. Document which resource gets saturation pages and which gets utilisation panels; supports investigation and review.
Concrete examples
Every shared resource has a saturation metric and a utilisation metric. The discipline is to identify the saturation metric for the resource, alert on it, and demote the utilisation metric to dashboards.
- Database. Connection pool wait time is saturation; connection count is utilisation; wait time predicts the problem first.
- Network. Packet drop rate is saturation; bandwidth utilisation is the trailing view; drops surface congestion before the link saturates.
- Disk. I/O wait queue depth is saturation; disk space used is utilisation; queue depth predicts I/O backpressure.
- Compute. CPU run-queue length is saturation; CPU percent is utilisation; run-queue catches contention CPU percent misses.