What toil is (and what it is not)
Toil is operational work that is manual, repetitive, automatable, tactical rather than strategic, devoid of enduring value, and that scales linearly as the service grows. Google's SRE practice coined the term to put a name on the specific kind of busywork that consumes operations teams: not hard work, not important work, but the grind of doing the same thing over and over to keep the lights on. The defining test is simple. When the task is finished, is the service in any better state than before you started? If the answer is no, the service is exactly where it was and only your time is gone, you were doing toil.
The original definition lists six characteristics, and the more of them a task has, the more clearly it is toil. It is manual, done by a human rather than a machine. It is repetitive, the same work performed again and again rather than something done once. It is automatable, meaning a machine could do it if someone built the automation, which separates toil from genuinely human judgment work. It is tactical, reactive and interrupt-driven rather than strategy that moves the system forward. It is devoid of enduring value, leaving the service no better than before. And it scales linearly with service growth: as traffic, users, or hosts double, the toil doubles with them. That last property is the one that makes toil dangerous, because it means the load grows on its own.
It is worth being precise about what makes something toil, because the label gets misused. Toil is not defined by difficulty. Plenty of toil is easy, even pleasant in small doses, which is exactly why it slips past unnoticed. Nor is toil defined by being unpleasant. Some genuinely valuable engineering work is tedious. What makes work toil is the combination above: it is automatable busywork with no lasting value that grows with scale. A novel debugging session on a failure nobody has seen before is hard and tedious, but it is not toil, because it requires judgment and it produces enduring knowledge.
What is NOT toil: overhead
The most common classification mistake is lumping overhead in with toil. Overhead is the administrative work of being on a team that is not tied to running a production service: team meetings, email and chat, expense reports, interviewing candidates, performance reviews, planning rituals, and HR paperwork. This work can be annoying and it can eat real hours, but it is fundamentally different from toil for two reasons. First, it is not tied to the production service, so it does not scale linearly with traffic. Second, and more importantly, it usually cannot and should not be automated away, because it involves human coordination, judgment, and relationships.
The distinction is not academic. If you classify overhead as toil, you inflate your toil number with work that does not belong there and you waste engineering effort trying to automate things that should be managed instead. If you try to fix a meeting problem with a script, you have misdiagnosed the issue. The clean rule: toil is automatable production work that scales with the service; overhead is administrative work that does not. Measure and attack toil with automation. Manage overhead with better process and saner calendars, but do not count it against your toil budget.
The one-line test for toil. Ask three questions of any recurring task: Could a machine do this? Does it produce nothing of lasting value? Does the amount of it grow as the service grows? Three yeses means it is toil and it belongs on your elimination list. If the work requires genuine human judgment, or it leaves the system permanently better, or it does not grow with scale, it is something else (engineering, or overhead) and a different playbook applies.
Why toil is the enemy
Toil feels productive in the moment, which is precisely what makes it so corrosive. You close the ticket, restart the service, approve the access request, and you feel like you got something done. But step back a quarter and the team has shipped nothing that reduced future load, and the load itself has grown. Toil is the enemy of reliability work for five connected reasons, and they compound.
Career stagnation
Engineers are hired to build, and a role that turns into ticket-processing erodes the skills and the motivation that made the hire valuable. An SRE who spends most of the week on routine operations is not growing, is not shipping a portfolio of work they are proud of, and is watching their market value drift while peers on greenfield teams advance. Stagnation is not just a morale issue; it is a direct pipeline to the next problem.
Attrition
Toil is the leading cause of burnout on operations teams, and burnout is the leading cause of attrition. The 3 a.m. pages, the interrupt-driven days, the sense that the work never compounds, all of it drives good engineers to leave. Attrition is the most expensive consequence of toil, because the cost of losing and replacing a senior SRE dwarfs the cost of the tooling that would have prevented it. Every quarter of unbounded toil raises the odds that your most experienced person walks.
Slower delivery
Every hour spent on toil is an hour not spent shipping reliability improvements, building automation, or hardening the architecture. A team that is underwater on toil has no slack to do the engineering that would dig it out, so delivery slows precisely when the team most needs to move faster. This is the trap that turns a temporary spike into a permanent state.
Higher error rates
Repetitive manual work performed under fatigue invites mistakes. A hand-typed remediation at 3 a.m. is where a missing flag takes down a second service, where the wrong host gets restarted, where a copy-paste error turns a small incident into a large one. The more toil a team carries, the more manual production changes it makes, and the more chances there are for human error to create the next incident.
Opportunity cost and the linear-scaling trap
The deepest reason toil is the enemy is that it scales linearly with the service while engineering scales sublinearly. When you automate a task, you pay the cost once and reap the benefit forever, even as the service grows. When you do a task by hand, you pay the cost every single time, and the number of times grows with the service. So a team that does not attack toil is on a treadmill that speeds up: as traffic doubles, the toil doubles, and the team falls further behind. This is the heart of site reliability engineering discipline. Toil bounded and attacked is a solvable problem; toil unmeasured and unbounded eventually consumes the entire team.
The SRE 50% rule
The single most important guardrail against the linear-scaling trap is the 50% rule: no more than half of an SRE's time should be spent on toil. The other half is protected for engineering work that reduces future toil or improves reliability: automation, tooling, better runbooks, architecture, and reliability projects. The cap is not arbitrary. It is the mechanism that keeps the treadmill from winning.
Why the cap exists
The logic follows directly from how toil and engineering scale. Toil scales linearly with the service, so if you do nothing it grows on its own. Engineering scales sublinearly, because automation you build once keeps paying off as the service grows. The 50% cap reserves enough engineering capacity that the team can always build the automation that brings toil back down faster than the service can push it up. If toil is allowed to exceed 50%, the team has too little engineering time left to keep up, toil grows further, and the team enters a death spiral where it spends ever more time on operations and ever less on the work that would fix operations. The cap is the boundary that keeps the team on the right side of that dynamic.
How to enforce it
A cap you do not measure is a wish. Enforcing the 50% rule means tracking the percentage of team time spent on toil as a first-class metric, the same way you track an error budget. Make it visible on a dashboard, review it every sprint, and treat a breach as an event that triggers action rather than a number to feel bad about. Practical enforcement levers include protecting engineering time on the calendar so it cannot be eaten by interrupts, rotating a dedicated interrupt handler so the rest of the team stays focused, and refusing to onboard new toil-generating responsibilities until the existing toil is automated down.
What happens when teams blow past it
When a team consistently runs above 50% toil, there are only two honest responses, and quietly accepting it is not one of them. The first is to reduce toil through automation, funding the engineering work to eliminate the largest sources. The second is to add headcount to restore the engineering capacity needed to attack the backlog. What kills teams is the third option that organizations drift into by default: accepting ever-rising toil as the new normal, which guarantees burnout, attrition, and eventually a reliability incident born of an overloaded team. The cap exists precisely so that breaching it forces a decision rather than a slow slide.
Most unbounded toil is on-call operate-and-remediate work. See how Nova owns it within a policy envelope.
Try Nova →How to measure toil
You cannot manage what you do not measure, and toil is invisible by default because it hides inside normal days as a dozen small interruptions. Making toil visible is the prerequisite for every other step. There are three complementary measurement methods, and mature teams use all of them.
Surveys
The fastest way to a baseline is to ask. A short recurring survey that asks each engineer what fraction of their week went to toil, and which tasks were the worst offenders, gives you a directional number and, more valuably, a ranked list of the tasks people hate most. Surveys are subjective and people misremember, so they are a starting point rather than the final word, but they surface the pain that time-tracking alone can miss and they get the whole team thinking in terms of toil.
Time-tracking
To turn the survey's directional number into something defensible, sample actual time. This does not require heavyweight timesheets; a lightweight tag on tickets, or a periodic time-use diary for a representative week, is enough to get an honest percentage. The goal is a trustworthy baseline figure for the team's toil percentage that you can put on a dashboard and watch over time. A number on a dashboard is a number a team can be held accountable for shrinking.
The toil inventory
The most useful artifact is a toil inventory: a living list of every recurring operational task the team performs, with each task scored on three axes. Frequency is how often the task happens (per day, per week, per month). Time is how long a single instance takes. Toll is the human cost beyond raw minutes: does it interrupt deep focus, does it fire at 3 a.m., does it carry stress or risk? Frequency multiplied by time gives you the raw hours per task per month, which is the number you prioritize on, and the toll tells you which tasks to weight above their hours because of their human cost.
| Axis | What it captures | How to score | Why it matters |
|---|---|---|---|
| Frequency | How often the task recurs | Times per day, week, or month | High frequency means high cumulative cost even for small tasks |
| Time | Duration of one instance | Minutes per occurrence | Long tasks waste the most per event |
| Toll | Human cost beyond minutes | Focus break, off-hours, stress, risk | A 3 a.m. page costs far more than its clock time |
| Hours/month | Frequency times time | Computed, ranked descending | The number you prioritize automation on |
Re-run the measurement every quarter. Toil reduction is not a one-time project, and a baseline that is a year old tells you nothing about whether your automation is working or whether new toil has quietly replaced what you removed. Quarterly re-measurement closes the loop: it proves the reclaim and it catches the new toil that always arrives as the system grows.
The toil-elimination playbook
With toil measured and inventoried, elimination becomes a disciplined, repeatable process rather than a heroic spasm of scripting. The playbook has five steps, run in order.
- Identify. Use the survey and the toil inventory to enumerate every recurring operational task. You cannot eliminate what you have not named, and the act of listing toil out loud is often the first time a team sees how much of it there is.
- Quantify. Score each task by frequency times time to get real hours per month, and weight by toll for the human cost. Now every item on the list has a number, and the team can argue about facts instead of feelings.
- Prioritize by frequency times effort. Rank candidates so you automate the highest-volume, lowest-risk tasks first. A task that happens fifty times a week and is trivial to automate beats a task that happens twice a year even if the rare one feels more painful. Prioritization is where most toil programs succeed or fail.
- Automate or eliminate. For each prioritized task, decide whether to automate it or remove it entirely. Sometimes the highest-leverage move is not to automate a task but to delete the need for it: a self-service portal that removes the request, a default that removes the manual choice, an architecture change that removes the failure class. Elimination beats automation when it is available.
- Measure the reclaim. After shipping the fix, re-measure the toil hours and prove the reclaim. This closes the loop, justifies the engineering time spent, and builds the case for funding the next round. A toil program that cannot show its reclaim will lose its budget.
Build versus buy
A recurring decision in step four is whether to build the automation yourself or buy a platform. The honest rule of thumb: build when the task is specific to your environment and the logic is simple and deterministic (a cron job, a small script, a deploy hook), because that automation is cheap to write and you understand it completely. Buy when the toil is operate-and-remediate work that must span clouds, react to live incidents, and exercise judgment under uncertainty, because that class of automation is expensive to build, harder to maintain, and never quite finished in-house. The trap is building bespoke incident-response automation that becomes its own source of toil to maintain. For the broader automation discipline, see DevOps automation.
Underneath all five steps is one cultural requirement: treat toil reduction as funded engineering work, not as something that happens in the spare time that never exists. The 50% cap is what creates that funded time. Without protected capacity, every elimination project loses to the next interrupt, and the team stays on the treadmill.
Common toil sources and their fixes
Most teams carry the same handful of toil sources, and each has a well-understood fix. The pattern across all of them is identical: the task is known, safe, and repetitive, which is the textbook definition of work a machine should own. Here are the seven that show up most often.
Manual deploys and rollbacks
Hand-running a deploy, or worse, hand-typing a rollback during an incident, is high-frequency, high-risk toil. The fix is a deploy pipeline with automated rollback on health-check failure, so shipping and reverting are buttons rather than command sequences. This single change removes one of the largest and riskiest toil sources on most teams.
Ticket-driven operations
When humans act as a queue for routine requests (provision this, grant that, reset the other), the team is doing pure toil that scales linearly with the organization. The fix is self-service: a portal or API that lets requesters serve themselves within guardrails, removing the human from the loop entirely. Elimination beats automation here, because the best ticket queue is the one that no longer exists.
Alert triage
Acknowledging, grouping, and routing a flood of alerts is interrupt-driven toil that destroys focus and drives alert fatigue. The fix is correlation and auto-grouping so that one root cause becomes one incident instead of two hundred pages, plus suppression of the alerts that never required human action. Less noise is less toil and a healthier on-call rotation.
Capacity changes and scaling
Manually adding or removing capacity in response to load is repetitive, automatable work that grows with the service. The fix is autoscaling driven by real demand signals, so the system right-sizes itself within bounds you set once. The engineering work is defining the policy; after that, the toil disappears.
Access and permission requests
Hand-granting access is both toil and a security risk, because manual grants drift and rarely get revoked. The fix is policy-as-code with approval workflows and time-bound grants, so access is requested, approved, and expired automatically. The human reviews policy, not every individual request.
Certificate and secret rotation
Manually rotating certificates and secrets is rare enough to feel low-priority and dangerous enough to cause an outage when forgotten. The fix is automated lifecycle management that issues, rotates, and retires credentials on a schedule without a human touching them, removing both the toil and the expired-cert incident class at once.
Runbook execution
Hand-executing a runbook during an incident, step by careful step under pressure, is the most expensive toil because it happens at the worst time. The fix is a progression: first codify the runbook so it is consistent, then automate the safe steps so they run on a button, then make execution agentic so the system runs the known-safe remediation itself. This is the bridge to self-healing infrastructure, where routine remediations happen without a human in the loop.
From scripted automation to autonomous operations
For two decades the answer to toil was scripting, and scripting genuinely removed a huge amount of it. But scripting has a hard limit, and the 2026 shift is about crossing it. Understanding that limit is the key to eliminating the toil that has stubbornly survived every automation push.
What scripted automation can and cannot do
Scripted automation handles the deterministic, fully specifiable parts of toil. A cron job, a deploy pipeline, an autoscaler, a rotation job: each removes a task you can describe completely in advance. If you can write down every step and every branch, you can script it, and you should. But a large and growing category of toil resists scripting because it requires judgment under uncertainty. Triaging a novel alert, correlating signals across a dozen systems, choosing which of several safe remediations fits this particular incident, and verifying that recovery actually happened: none of these can be reduced to a fixed sequence, because the situation is different every time. This is the operate-and-remediate toil that survived two decades of scripting, and on most teams it is now the largest remaining source.
The shift to autonomous operations
Autonomous operations is the move from replaying a pre-written script to reasoning about the incident. An agentic system detects an issue, diagnoses it by correlating signals the way a human would, selects a remediation appropriate to the situation, executes it within a policy envelope that bounds what it is allowed to do, and verifies recovery. Because the agent reasons rather than replays, it covers the judgment-under-uncertainty toil that scripts could never reach: auto-triage that groups and ranks instead of just forwarding, and auto-remediation that chooses the right fix instead of running a fixed one. This eliminates whole categories of toil at once, not by automating individual tasks but by owning the operational loop those tasks lived inside. For the broader category, see AIOps and agentic SRE.
Where Nova removes the toil scripts could not
This is where Nova AI Ops fits. Nova is the agentic layer that owns the operate-and-remediate toil scripting left behind. When an incident fires, Nova's 100 specialized agents across 12 teams detect it, diagnose it by correlating signals across AWS, GCP, Azure, Linux, and Windows, and resolve the known-safe class of issues within a policy envelope, so the repetitive triage-and-fix work that dominates on-call toil happens without a human typing commands at 3 a.m. It does not replace your monitoring or your paging tool; it operates on top of them, so humans do engineering instead of toil. The payoff is structural: the on-call load that scales linearly with traffic, the toil that grows fastest as a service grows, stops growing with traffic. For how this compresses recovery time, see MTTR reduction, and for the on-call experience specifically, see on-call management.
A 90-day toil-reduction program
A staged program that makes toil visible first, removes the biggest source second, and automates the operate-and-remediate span third. You get a trustworthy toil number in the first month; the rest is compounding it down. The program ends with a 10-point audit checklist you can run every quarter.
Days 1-30: Make toil visible
You cannot reduce a number you do not have. Run a toil survey, sample actual time for a representative week, and build the toil inventory with every recurring task scored on frequency, time, and toll. Compute hours per month per task and rank the list descending. Establish the team's baseline toil percentage and put it on a dashboard next to your reliability metrics. Most teams are surprised by the number; that surprise is the point, because an invisible problem cannot be funded and a measured one can.
Days 31-60: Eliminate the biggest source
Take the top item from the ranked inventory and remove it, by automation or by elimination, whichever has higher leverage. If ticket-driven ops dominates, ship self-service. If alert triage dominates, add correlation and suppression. If manual deploys dominate, build the pipeline with automated rollback. Resist the urge to attack everything at once; remove the single biggest source, measure the reclaim, and prove the model works before scaling the effort. One decisively eliminated source builds more momentum than ten half-finished ones.
Days 61-90: Automate the operate-and-remediate span
With the deterministic toil scripted away, turn to the operate-and-remediate work that scripting could not reach. Codify your top incident runbooks, automate the safe steps, and layer in agentic auto-resolution for the classes you trust, bounded by a policy envelope. This is where an agentic platform like Nova slots in on top of the measurement and elimination work from the first two phases, owning the routine triage-and-fix loop so the linearly scaling on-call toil stops growing with the service. Feed every quarter's re-measurement back into the inventory so the program compounds rather than resets.
The 10-point toil-audit checklist
Run this checklist every quarter to keep toil bounded below the 50% cap. Each item is a yes/no the team should be able to answer honestly.
- Baseline measured. Do you know your current team toil percentage, and is it on a dashboard?
- Under the cap. Is toil below 50% of team time, and trending down rather than up?
- Inventory current. Is the toil inventory re-scored within the last quarter?
- Ranked by hours. Are tasks prioritized by frequency times time, not by which one feels worst?
- Overhead excluded. Have you kept meetings, email, and admin out of the toil number?
- Deploys automated. Are deploys and rollbacks pipelines, not hand-run commands?
- Self-service in place. Have routine requests been moved off the human queue?
- Alerts correlated. Does one root cause produce one incident instead of a flood of pages?
- Runbooks codified. Are your top incident runbooks codified and the safe steps automated?
- Reclaim proven. Can you show the hours reclaimed by last quarter's toil work?
The classic mistake is skipping the first phase: teams buy automation before they can measure honestly, then cannot tell whether anything helped because the toil number was never established. Measure first; every reduction downstream depends on a trustworthy baseline and a ranked inventory to aim it at.
Frequently asked questions
What is toil in SRE?
What is the difference between toil and overhead?
What is the 50% toil cap?
Why is toil bad?
How do you measure toil?
How do you eliminate toil?
What are the most common sources of toil?
How is autonomous operations different from scripted automation for toil?
Can you ever eliminate all toil?
How does Nova AI Ops reduce toil?
Related guides
Go deeper into the reliability stack: site reliability engineering for the discipline toil lives inside; DevOps automation for the deterministic toil scripts remove; self-healing infrastructure for automating the remediation span; on-call management and alert fatigue for the on-call toil that scales fastest; MTTR for how collapsing diagnosis compounds the reclaim. For the broader category: AIOps, agentic SRE, and AI SRE for how agents operate your systems; AI incident response and incident management for the lifecycle; root cause analysis for the diagnose span. On the telemetry and planning that target toil: observability, anomaly detection, and capacity planning. See the Nova AI Ops feature set across detection, diagnosis, and auto-resolution.
Stop doing toil. Let agents own the routine and reclaim your team's time.
Nova AI Ops is the Multi-Agent OS for SRE & DevOps. 100 specialized AI agents across 12 teams detect, diagnose, and auto-resolve the safe class of incidents within a policy envelope across AWS, GCP, Azure, Linux, and Windows, owning the operate-and-remediate toil that scripts could never reach so your engineers do engineering. Free tier available for small teams.