Measuring Toil: The First Step Is Counting It
Most teams know they have too much toil and have no idea how much. Counting is unglamorous, takes two weeks, and changes the conversation more than any automation project does.
What counts as toil
Toil is operational work that is manual, repetitive, automatable, tactical (not strategic), and scales linearly with service growth. Four out of five and you are probably looking at toil. All five and you definitely are.
The five-criteria framing comes from the Google SRE book and has held up well. Each criterion catches a different failure mode of "is this work worth doing?" Manual: a human is in the loop unnecessarily. Repetitive: it happens often enough that automating it would pay back. Automatable: the technology to automate exists; only the engineering effort is missing. Tactical: it doesn't move the team toward strategic goals. Linear-scaling: doing more of the work serves more load but doesn't make the system better.
The category that's commonly missed is "tactical not strategic." Engineers often defend toil as "important operational work." It is important; it's also not strategic. The distinction matters because strategic work compounds (each unit moves the system permanently forward) while toil doesn't (each unit just delays the next unit).
How to count it
For two weeks, every engineer logs every operational task with: minutes spent, what the task was, and whether they could imagine an automated version. No judgement; just data. Most teams discover toil is 60-80% of their operational time, sometimes more.
The discipline of the two-week count. Daily logging is too granular and creates friction. Weekly aggregation loses fidelity (engineers forget what happened Monday by Friday). The end-of-day stand-up entry is the sweet spot — five minutes of writing while the day is fresh.
What the count surfaces. Most teams discover that operational work they thought was 30% of their time is actually 60-70%. The discrepancy is real; engineers under-report toil because they perceive it as "small" tasks that don't add up. The two-week count makes the addition unavoidable.
The 50% target
Aim for under 50% of an engineer's time on toil. Above 60% and the team is not doing engineering anymore; it is doing ops with engineering titles. Below 30% may mean undermaintained services, not victory.
Why 50%, not 0%. Some toil is irreducible — the operational work of running a system that hasn't been automated yet (and may never economically be automated for tiny use cases). Targeting 0% is unrealistic and demoralising. 50% is achievable for most teams within 1-2 quarters of focused effort.
The under-30% concern. A team with very low toil is often a team that's neglecting operational work. Things that should be done — runbook updates, dashboard maintenance, dependency upgrades — are being skipped. Below 30% toil with rising incident rates is a warning, not a victory.
Four classes of toil
- Reactive: pages, alerts, customer escalations.
- Manual ops: scaling, restarting, rotating, certificate updates.
- Investigation: someone asks "why is X slow?" and an engineer disappears for an afternoon.
- Coordination: standups, status updates, cross-team handoffs.
Each class has different reduction strategies. Reactive: fix the alert noise (covered separately). Manual ops: automation playbooks. Investigation: better observability so questions answer themselves. Coordination: tooling that surfaces status without requiring meetings.
The classification matters because teams often try to fix the wrong class first. A team drowning in reactive toil that invests in investigation tooling spends a quarter on the wrong project. Classify before you optimise.
Which class to attack first
Reactive almost always. The cost of a 3am page is not the 30 minutes on the call; it is the next day's reduced output and the chronic fatigue of being on the rotation. Killing one chronic page typically gives back hours of effective engineering per week.
The leverage math. A page costs ~3 hours when you account for: the call itself (30-60 minutes), the recovery time the next day (1-2 hours of degraded focus), and the cumulative on-call dread (small but real, multiplied across the team). A page that fires twice a week costs ~6 engineer-hours per week. Killing it pays back the alert-tuning work in days.
The exception. If reactive toil is already low (under 5 pages/week) and manual ops is high (engineers spending 4+ hours/day on routine operations), attack manual ops first. The leverage is comparable; the work is more straightforward.
How to reduce each class
Reactive toil: tune alert thresholds, add debouncing, delete chronic-flap alerts entirely. Move from symptom-based alerts to SLI-based alerts. Each chronic page killed saves hours per week.
Manual ops: automate one operation per sprint. Pick the most-frequent ones. The first automation is hard (you build the framework); subsequent automations get easier as the framework matures.
Investigation: invest in observability. Better dashboards, better trace propagation, better log search. The metric: how often does an investigation that used to take an afternoon now take 15 minutes? Track the ratio.
Coordination: tooling that pushes status to the people who need it instead of pulling them into meetings. Async standup bots, automated status pages, slack-integrated incident notifications. Each meeting eliminated saves the meeting time multiplied by attendees.
Common antipatterns
The "we'll automate next sprint" deferral. Toil reduction always loses to feature work in sprint planning unless leadership explicitly protects time. Allocate 20% of capacity to toil reduction; protect it.
Automating bad processes. Engineer automates the existing manual process. The automation is now bad-process-but-faster. The right move: question whether the process should exist before automating it. Sometimes the answer is "delete the process," which is the cheapest possible automation.
The "elite engineer" automation that nobody else can maintain. Senior engineer writes a bash script that's clever and undocumented. Six months later they leave; nobody can modify the script. The toil is back, but now disguised as "the script broke." Automation must be maintainable by the team, not just the author.
Re-measure quarterly
The two-week toil log is not a one-time exercise; it is a quarterly check. Most teams find the absolute hours rise (because growth) but the percentage falls (because automation). That is a healthy direction.
The trend matters more than the absolute. A team at 60% toil this quarter that was at 75% last quarter is winning, even though 60% is technically still high. A team at 50% that was at 40% last quarter is losing, even though 50% is technically OK. The slope is the signal.
What to do when the percentage rises. Investigate. Usually one of: a major incident pulled the team into reactive mode (one-quarter blip, recovers next quarter); a new service shipped without automation (build the automation now); team grew and new engineers don't know the automation (training gap). Each has a different fix.
What to do this week
Three moves. (1) Start the two-week toil log this Monday. End-of-day five-minute entries; aggregate at end of week 2. (2) Once you have data, identify the single biggest toil source and dedicate one engineer-sprint to reducing it. The reduction has to be real (measurable hours saved), not aspirational. (3) Schedule the next quarterly re-measurement on the team calendar. The recurring discipline is what prevents toil from reaccumulating after the initial reduction.