The manual, repetitive, automatable work that scales linearly with service growth, the operational debt SRE was invented to eliminate.
Toil, in Google's SRE definition, is operational work that is manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows. Examples: manually rotating IAM keys monthly, hand-editing a runbook every time a service is added, paging the on-call to scale replicas every Black Friday. Google's SRE book recommends keeping toil under 50% of an SRE's time; teams above that threshold cannot keep up with growth and burn out.
Every hour an SRE spends on toil is an hour not spent on work that compounds, automation, postmortem fixes, capacity planning. A toil budget of 50% sounds generous; in practice most teams are closer to 80% and shrinking the share by even 10 points unlocks dramatically more reliability investment per quarter.
See the part of the platform that handles toil in production.