Disk-Full Remediation: From Page to Fix Without a Human

Detect. Identify the largest dirs. Pick a safe cleanup. Apply it. Verify. The agent that handles a full disk in 90 seconds and the safety rails that keep it from deleting your prod logs.

Detection signal

The detection signal is conservative. Disk-percent-used > 90% for 5 minutes is the trigger; the threshold is conservative because intervention before 100% prevents service degradation. The metric source is the host’s filesystem stats and the agent gets the host name, mount point, percent-used value.

90% for 5 minutes. Conservative trigger; intervention before 100%.
Pre-100% intervention. Prevents service degradation; the safety margin matters.
Metric source: filesystem stats. Host name, mount point, percent-used.
Per-mount evaluation. Each mount independent; multi-mount hosts get per-mount triage.

Identifying the culprit

The agent identifies the dominant directories. List the top 5 directories by size (one or two usually dominant: logs, caches, dumps); cross-reference with known-cleanable directories (log rotation paths, cache directories, temp directories); flag the unexpected because a directory that grew 10x in the last hour is worth surfacing even when small in absolute terms.

Top 5 by size. One or two usually dominant; logs, caches, dumps.
Cross-reference cleanable. Log rotation, cache, temp; the agent knows the safe paths.
Flag 10x growth. Sudden growth surfaces; even small absolute size matters.
Per-host pattern detection. The agent learns the host’s normal pattern; deviation is signal.

The safe-cleanup decision

The decision is conservative. Logs older than 7 days in /var/log: safe to delete; temp files older than 24 hours: safe; application data never auto-delete because it always requires human approval even when it looks like a cache; when in doubt the agent proposes rather than acts because disk-full is uncomfortable but deleting the wrong file is worse.

Old logs (> 7 days). /var/log; safe to delete.
Old temp files (> 24 hours). Safe to delete; the standard temp cleanup.
Application data: human only. Even if it looks like a cache; safety margin.
Propose when in doubt. Wrong-file deletion is worse than disk-full discomfort.

The 90-second flow

The flow is fast and structured. Detection at 0:00, list by size at 0:15, identify safe-to-delete at 0:30, propose to operator (or auto-apply if pre-approved) at 0:45, execute at 1:00, verify at 1:30. Verify is a recheck of disk-percent-used; if it dropped the action worked, if not escalate. Logging the entire timeline gives a clean audit trail.

0:00-0:30: detect, list, identify. First 30 seconds for assessment.
0:45-1:00: propose and execute. Operator sign-off (or pre-approval) then action.
1:30: verify. Recheck disk-percent-used; if dropped, worked; if not, escalate.
Per-step audit log. Timeline captured; supports postmortem and compliance.

Safety rails

Three rails bound the agent’s blast radius. Allowlist of paths the agent can delete; anything outside requires human approval. Maximum bytes deleted per action: 10GB; beyond that, the agent escalates regardless of safety class. Cooldown: do not auto-clean the same path twice in 30 minutes because multiple cleanups suggest a deeper problem and the right response is escalation.

Path allowlist. Agent can only delete from allowlist; outside requires human approval.
10GB per-action cap. Beyond that, escalates regardless of safety class.
30-minute cooldown. Same path twice in 30 minutes: escalate, not re-clean.
Per-rail logged when triggered. Each safety rail engagement captured; supports continued tuning.