Disk-Full Remediation: From Page to Fix Without a Human

Detect. Identify the largest dirs. Pick a safe cleanup. Apply it. Verify. The agent that handles a full disk in 90 seconds and the safety rails that keep it from deleting your prod logs.

Detection signal

The detection signal is conservative. Disk-percent-used > 90% for 5 minutes is the trigger; the threshold is conservative because intervention before 100% prevents service degradation. The metric source is the host’s filesystem stats and the agent gets the host name, mount point, percent-used value.

Identifying the culprit

The agent identifies the dominant directories. List the top 5 directories by size (one or two usually dominant: logs, caches, dumps); cross-reference with known-cleanable directories (log rotation paths, cache directories, temp directories); flag the unexpected because a directory that grew 10x in the last hour is worth surfacing even when small in absolute terms.

The safe-cleanup decision

The decision is conservative. Logs older than 7 days in /var/log: safe to delete; temp files older than 24 hours: safe; application data never auto-delete because it always requires human approval even when it looks like a cache; when in doubt the agent proposes rather than acts because disk-full is uncomfortable but deleting the wrong file is worse.

The 90-second flow

The flow is fast and structured. Detection at 0:00, list by size at 0:15, identify safe-to-delete at 0:30, propose to operator (or auto-apply if pre-approved) at 0:45, execute at 1:00, verify at 1:30. Verify is a recheck of disk-percent-used; if it dropped the action worked, if not escalate. Logging the entire timeline gives a clean audit trail.

Safety rails

Three rails bound the agent’s blast radius. Allowlist of paths the agent can delete; anything outside requires human approval. Maximum bytes deleted per action: 10GB; beyond that, the agent escalates regardless of safety class. Cooldown: do not auto-clean the same path twice in 30 minutes because multiple cleanups suggest a deeper problem and the right response is escalation.