Agentic SRE Advanced By Samson Tanimawo, PhD Published May 21, 2026 5 min read

Disk-Full Remediation: From Page to Fix Without a Human

Detect. Identify the largest dirs. Pick a safe cleanup. Apply it. Verify. The agent that handles a full disk in 90 seconds and the safety rails that keep it from deleting your prod logs.

Detection signal

Disk-percent-used > 90% for 5 minutes is the trigger. The threshold is conservative; intervention before 100% prevents service degradation.

The metric source is the host's filesystem stats. The agent gets the host name, the mount point, the percent-used value.

Identifying the culprit

List the top 5 directories by size. The expected pattern: one or two are dominant (logs, caches, dumps).

Cross-reference with known-cleanable directories: log rotation paths, cache directories, temp directories.

Flag the unexpected: a directory that grew 10x in the last hour is a problem worth surfacing even if it is small in absolute terms.

The safe-cleanup decision

Logs older than 7 days in /var/log: safe to delete. Temp files older than 24 hours: safe.

Application data: never auto-delete. Always require human approval, even if it looks like a cache.

When in doubt, the agent should propose, not act. Disk-full is uncomfortable; deleting the wrong file is worse.

The 90-second flow

Detection at 0:00. List by size at 0:15. Identify safe-to-delete at 0:30. Propose to operator (or auto-apply if pre-approved) at 0:45. Execute at 1:00. Verify at 1:30.

Verify is a recheck of the disk-percent-used metric. If it dropped, the action worked. If it did not, escalate.

Logging the entire timeline gives a clean audit trail.

Safety rails

Allowlist of paths the agent can delete. Anything outside the allowlist requires human approval.

Maximum bytes deleted per action: 10GB. Beyond that, the agent escalates regardless of safety class.

Cooldown: do not auto-clean the same path twice in 30 minutes. Multiple cleanups suggest a deeper problem; escalate instead.