Automation Debt: The Slow Drag You Cannot See
Automation debt is the gap between the operational tasks your team does manually and the ones a reasonable team your size has automated. It compounds quietly until a senior leaves and the team realises only they knew the script.
What automation debt is
Every operational task on your team is somewhere on a spectrum from "fully automated and self-service" to "only Sara knows how to do it, and she does it from her laptop using a notebook from 2021." Automation debt is the cumulative cost of every task that sits closer to the second end than to the first.
The debt accumulates invisibly. Each task that's "manual but Sara knows it" is fine in isolation. Across 50 tasks, the team has 50 single-points-of-failure. Sara goes on vacation; 50 things stop working. Sara leaves the company; the team rediscovers each task from scratch, slowly, painfully.
The reason this is debt and not just "tech debt." Standard tech debt is in the code; you can read it. Automation debt is in people's heads; you can't read it until you need it. The discovery of automation debt usually happens during incidents, when you most need things to work.
How it accrues
Almost always quietly. An engineer writes a script for a one-off migration; it works; nobody throws it away. The next time the situation arises, the new engineer cannot find it; rewrites it; subtly differently. A senior leaves; the script that handled the quarterly secret rotation goes with them. The team has been working harder than it should have for months and has not noticed.
The accumulation pattern. Each individual instance is small (one engineer, one task, one shortcut). The pattern repeats hundreds of times. By year 2-3 of a team, the accumulated debt is structural — even routine work requires unusual effort because the underlying tasks have unwritten dependencies on individuals' tacit knowledge.
The signal that the debt is real. Standups where someone says "Sara's the one who knows that, let's wait for her to come back from vacation." Heard once: probably fine. Heard weekly: the team has automation debt that's affecting velocity.
Four classes
- One-off scripts: written for a specific incident, never productionised, never deleted.
- Undocumented procedures: in someone's head and nowhere else.
- Vendor-locked tooling: the only way to do X is in vendor Y's UI.
- Manual-only paths: tasks where the automation exists but is broken or unmaintained.
Each class has different reduction work. One-off scripts: review quarterly, productionise the useful ones, delete the rest. Undocumented procedures: write them down (the most common gap). Vendor-locked: find or build automation that exits the vendor's UI. Manual-only: fix the broken automation, or kill it and document the manual procedure.
The class with the worst leverage is "undocumented procedures" because it's also the cheapest to fix. A 30-minute documentation session captures what's in someone's head; the captured doc lives forever. The team's resistance is cultural ("I'll write it down later"); leadership pressure is what makes it happen.
Tracking it
One spreadsheet. Three columns: task, current state, what would automating it require. The spreadsheet itself is the start of paying the debt down because it is the first inventory you have.
The spreadsheet's effect. Listing the tasks reveals the scale of debt. Most teams expect 20-30 entries; they find 80-150. The reaction is usually disbelief followed by acceptance — yes, the team has been carrying this much debt; now it's visible.
The "current state" column has three values: automated (good), manual-but-documented (acceptable), manual-and-undocumented (debt). Sort by state; the bottom is your work queue.
Which to pay first
Anything that is currently a single point of failure (only one engineer knows it) goes first. Then anything done weekly or more often. Then everything else.
The single-point-of-failure priority. SPOFs are bombs; they go off when the engineer leaves or is unavailable. The cost of clearing one is low (write a doc, share knowledge); the cost of NOT clearing one is open-ended (entire team blocked when SPOF engineer is unavailable).
The frequency priority. A task done weekly that takes 30 minutes consumes ~26 hours/year. Automating it costs ~8 hours; payback in 4 months. A task done quarterly takes longer to pay back. Lead with the high-frequency ones.
Common debt patterns
The "deploy script" pattern. The team's deploy is a series of manual steps documented in a wiki. Each engineer who does it deviates slightly. Six months later there are three different "ways" to deploy, each with adherents. Standardise the deploy in CI; one way, automated.
The "credentials rotation" pattern. Credentials rotate quarterly; one engineer remembers the procedure. They leave; the next rotation is improvised, possibly missing a step. Document the procedure in the runbook system; have the on-call run it, witnessed by another engineer.
The "vendor portal" pattern. To make a change in vendor X, log into their portal, navigate three menus, click a specific button. Nothing in the team's automation can do this. Find the vendor's API; build a small wrapper script. The portal becomes the fallback rather than the only path.
The signal that you are winning
New engineers can do most operational tasks in their first month with the existing tools. If the answer to "how do I do X?" is reliably "ask Sara," you have not paid the debt down.
The onboarding experiment. A new hire's first month is the test. Can they do routine ops without paging veterans? Can they answer common questions from documentation? If yes, the debt is paid; if no, the debt is still there.
The trend metric. Track how often "ask Sara" comes up in standups. Decreasing means the documentation and automation are working; flat or increasing means the debt is accumulating faster than it's being paid.
Common antipatterns
The "we'll do it during slow weeks" plan. Slow weeks don't exist. Allocate explicit capacity (10-20% of team) to debt reduction.
The 200-line documentation. Engineer writes a comprehensive doc. Nobody reads it. Documents must be short — half a page, screenshot-heavy, command-by-command. Comprehensive ≠ useful.
The script in someone's home directory. Engineer writes a script; saves it to ~/scripts/. The script works; nobody else can find it. Always commit scripts to a team repository; make them discoverable.
The "automation" that's a 50-line bash script. Bash scripts that grow past 50 lines become unmaintainable. Use a real language, version it, test it. The discipline of treating automation as software is what makes it durable.
What to do this week
Three moves. (1) Start the spreadsheet. List 20 operational tasks; categorise each as automated/documented/undocumented. The inventory takes 2 hours; reveals the shape. (2) For each undocumented task, schedule a 30-minute "knowledge capture" session with the engineer who knows it. They walk through the task while a teammate writes the doc. 30 minutes × 20 tasks = 10 hours of work; produces durable documentation. (3) Set a recurring quarterly review of the spreadsheet. Without recurrence, the inventory becomes stale within months.