The Saturation-Hits-Disk-Full Pattern
The most common saturation incident: disk full. The leading indicators, the alerts, and the prevention.
Leading indicators
Disk-full incidents are operationally distinct from most other saturation issues. The lead time before failure is real (hours, sometimes days); the failure mode is severe (write failures, application crashes); the recovery often requires careful intervention. Detecting disk-full early enough to act is the discipline; relying on threshold alerts at 90% or 95% is too late.
What good leading indicators look like:
- Disk fill rate (GB per hour).: Calculate the rate of disk usage growth over a recent window. The rate combined with the available space produces an estimated time-to-full. The estimate is the leading indicator.
- Alert when projected full in less than 4 hours.: The 4-hour window is enough time to investigate and remediate before the disk is actually full. Shorter windows produce panic; longer windows accept too much risk.
- Inode usage in addition to bytes.: Some filesystems run out of inodes before they run out of bytes. Many small files consume inodes faster than bytes; the filesystem fails on inode exhaustion even though byte usage is low. Both metrics need monitoring.
- Some systems run out of inodes before bytes.: Workloads that produce many small files (some logging patterns, certain application caches, mail spools) hit inode limits first. Without inode monitoring, the disk-full incident comes from an unexpected direction.
- Watch growth across all mounted filesystems.: Different mount points fill at different rates. The leading indicator is per-mount; aggregate disk metrics hide per-mount patterns.
Leading indicators give the team time to act. Without them, the alert comes when the disk is already full, which is too late.
Alert
The alerting strategy converts the leading indicators into actionable signals. Multi-window thresholds catch both sudden spikes and gradual drift.
- Multi-window: 1 hour rate, 4 hour rate.: Two windows produce two different signals. The 1-hour rate catches sudden growth spikes; the 4-hour rate catches gradual drift. Each window has its own threshold and alert.
- 1-hour catches sudden spikes.: A bug that suddenly starts writing huge volumes of data shows up in the 1-hour rate. The alert fires fast; the team responds before the disk fills.
- 4-hour catches drift.: Slow growth over hours does not trigger the 1-hour alert but does trigger the 4-hour. Workloads that gradually fill the disk without spiking are caught.
- Threshold: 4 hours to full triggers; 1 hour to full pages.: Different time-to-full thresholds map to different urgency. 4 hours triggers a ticket and Slack alert; 1 hour pages on-call. The escalation matches the severity.
- Per-mount alerts.: Each mount has its own alerts. The team knows exactly which mount is filling; the response is targeted to the affected filesystem.
Multi-window alerting is more nuanced than single-threshold; it catches more patterns with fewer false alarms.
Prevention
The best disk-full incident is the one that does not happen. Prevention combines automated cleanup with capacity planning; the disk fills only when the team's planning is wrong, not when routine maintenance is missed.
- Auto-cleanup of tmp and old logs.: Routine cleanup runs automatically. /tmp older than X days is removed; logs rotated and compressed; archives moved to cheaper storage. The cleanup runs every hour or every day; disk usage stays bounded.
- Older than threshold.: Each cleanup target has a retention threshold. /tmp at 7 days; logs at 30 days; archives moved at 90 days. The thresholds are documented and reviewed.
- Capacity planning quarterly.: Once per quarter, the team reviews disk capacity. Are growth trends sustainable? Is current capacity sufficient for 6 months? Should we provision more? The quarterly cadence catches issues before they become incidents.
- Provision before the team runs out.: Adding storage is easier when it is planned. Rushed provisioning during an active incident is harder; the team's capacity to respond is limited; mistakes happen. Plan capacity ahead of need.
- Postmortem disk-full incidents.: When a disk-full incident does happen, postmortem it. What was the root cause? Why did the leading indicators not catch it? What changes to monitoring or process prevent recurrence? The discipline produces continuous improvement.
Saturation hits disk-full pattern is one of the most preventable categories of incidents. Nova AI Ops integrates with disk telemetry, surfaces fill-rate trends, alerts on projected exhaustion, and produces the operational visibility that the platform team needs to keep disk-full off the incident list.