Disk-Full Remediation: From Page to Fix Without a Human
Detect. Identify the largest dirs. Pick a safe cleanup. Apply it. Verify. The agent that handles a full disk in 90 seconds and the safety rails that keep it from deleting your prod logs.
Detection signal
The detection signal is conservative. Disk-percent-used > 90% for 5 minutes is the trigger; the threshold is conservative because intervention before 100% prevents service degradation. The metric source is the host’s filesystem stats and the agent gets the host name, mount point, percent-used value.
- 90% for 5 minutes. Conservative trigger; intervention before 100%.
- Pre-100% intervention. Prevents service degradation; the safety margin matters.
- Metric source: filesystem stats. Host name, mount point, percent-used.
- Per-mount evaluation. Each mount independent; multi-mount hosts get per-mount triage.
Identifying the culprit
The agent identifies the dominant directories. List the top 5 directories by size (one or two usually dominant: logs, caches, dumps); cross-reference with known-cleanable directories (log rotation paths, cache directories, temp directories); flag the unexpected because a directory that grew 10x in the last hour is worth surfacing even when small in absolute terms.
- Top 5 by size. One or two usually dominant; logs, caches, dumps.
- Cross-reference cleanable. Log rotation, cache, temp; the agent knows the safe paths.
- Flag 10x growth. Sudden growth surfaces; even small absolute size matters.
- Per-host pattern detection. The agent learns the host’s normal pattern; deviation is signal.
The safe-cleanup decision
The decision is conservative. Logs older than 7 days in /var/log: safe to delete; temp files older than 24 hours: safe; application data never auto-delete because it always requires human approval even when it looks like a cache; when in doubt the agent proposes rather than acts because disk-full is uncomfortable but deleting the wrong file is worse.
- Old logs (> 7 days). /var/log; safe to delete.
- Old temp files (> 24 hours). Safe to delete; the standard temp cleanup.
- Application data: human only. Even if it looks like a cache; safety margin.
- Propose when in doubt. Wrong-file deletion is worse than disk-full discomfort.
The 90-second flow
The flow is fast and structured. Detection at 0:00, list by size at 0:15, identify safe-to-delete at 0:30, propose to operator (or auto-apply if pre-approved) at 0:45, execute at 1:00, verify at 1:30. Verify is a recheck of disk-percent-used; if it dropped the action worked, if not escalate. Logging the entire timeline gives a clean audit trail.
- 0:00-0:30: detect, list, identify. First 30 seconds for assessment.
- 0:45-1:00: propose and execute. Operator sign-off (or pre-approval) then action.
- 1:30: verify. Recheck disk-percent-used; if dropped, worked; if not, escalate.
- Per-step audit log. Timeline captured; supports postmortem and compliance.
Safety rails
Three rails bound the agent’s blast radius. Allowlist of paths the agent can delete; anything outside requires human approval. Maximum bytes deleted per action: 10GB; beyond that, the agent escalates regardless of safety class. Cooldown: do not auto-clean the same path twice in 30 minutes because multiple cleanups suggest a deeper problem and the right response is escalation.
- Path allowlist. Agent can only delete from allowlist; outside requires human approval.
- 10GB per-action cap. Beyond that, escalates regardless of safety class.
- 30-minute cooldown. Same path twice in 30 minutes: escalate, not re-clean.
- Per-rail logged when triggered. Each safety rail engagement captured; supports continued tuning.