Service Ownership: The On-Call Tax Nobody Calculates
When a team adds a new service, the operational cost is real and almost always undercounted. The on-call tax is the work that comes with owning the service forever, not the work to ship it.
The hidden tax
Shipping a service is a one-time cost. Owning it is forever. Most teams plan for the first and forget the second, which is why the team that owned five services two years ago now owns 30 and has the same headcount.
The compounding problem. Each service shipped adds operational cost. The team's capacity didn't grow proportionally. Two years in, the team is operationally at full saturation; new feature work is slower than the team thinks; reliability work is squeezed; engineers consider leaving.
The leadership invisibility. Engineering managers see the deliverables (services shipped) but rarely see the operational tax. The tax is paid in 2-hour interruptions, weekend pages, and slow degradation of velocity. By the time the tax is visible (attrition, sluggish releases), it's already structural.
Four cost drivers
The on-call tax for a single service is roughly: alert volume + runbook maintenance + dependency upgrade load + knowledge maintenance. Each contributes a portion of an engineer's time per month.
The decomposition's value. Knowing where the tax comes from lets the team target reduction work. Heavy alert volume → fix the alerts. Heavy runbook load → improve the runbooks. Heavy upgrade load → manage dependencies more strategically. Without decomposition, "we're overloaded" has no specific fix.
Alert volume
Even a quiet service typically pages once or twice a quarter. Each page is roughly 2-4 hours of engineering time (the call plus the followup). A service that pages ten times a quarter is consuming 20-40 hours; a meaningful chunk of an engineer.
The math at scale. A team with 30 services × 5 pages/quarter average = 150 pages/quarter. At 3 hours/page = 450 engineer-hours/quarter on operational work. That's nearly 3 full-time engineers' worth, just on alert response. For a 10-engineer team, it's 30% of capacity.
The reduction work. The noise filter reduces alert volume; better SLIs reduce false positives; auto-remediation handles the recurring ones. Each reduction translates directly into engineering capacity. Investing 2 weeks in alert tuning often pays back in the first quarter.
Runbook maintenance
Runbooks rot. Validating them, updating them, training new on-callers on them. Roughly 4-8 hours per quarter per service to keep a B-grade runbook from sliding to D.
The compounding cost. 30 services × 6 hours/quarter = 180 hours/quarter on runbook maintenance. About 1 full-time engineer's worth, just keeping documentation accurate. The team that doesn't budget for this discovers their runbooks are unusable when they need them.
The reduction strategy. Auto-validated runbooks (the runbook itself is executable, run by automation periodically). Inline links to live data (instead of static screenshots, link to the dashboard). Each reduces the manual maintenance burden.
Dependency upgrade load
Every dependency in the service has its own release cadence. Security patches, language upgrades, framework migrations. A service with 50 dependencies typically requires 1-2 hours of upgrade work a month even when nothing breaks.
The 1-2 hours assumes upgrades go smoothly. When they don't (breaking changes, deprecated APIs, transitive dependency conflicts), a single upgrade can consume a day. Across 30 services with 50 dependencies each, the math is unforgiving.
The reduction strategies. Dependency consolidation (fewer dependencies = fewer upgrades). Automated upgrade tooling (Renovate, Dependabot). Quarterly upgrade sprints (consolidate the work into focused time rather than scattered hours).
Knowledge maintenance
Onboarding new engineers, documenting what changed, keeping the architecture diagram up to date. Often the first cost cut and the first to bite back when a senior leaves.
The cost is invisible until it isn't. The team that doesn't invest in documentation runs fine while the senior engineers are present. When a senior leaves, the team discovers that critical knowledge wasn't documented; new engineers ramp slowly; ownership of the senior's services is unclear.
The investment. 1-2 hours per service per quarter on documentation upkeep: architecture diagram, runbook updates, onboarding guides. Multiplied across services, ~50-100 hours/quarter. Cheap compared to the cost of a poorly-documented service after attrition.
When to merge or archive
If a service costs more in on-call tax than it produces in business value, it is time to merge it into a sibling, hand it off, or archive it. The decision is rarely made because the cost was never measured. Measure it; the case writes itself.
The measurement. For each service, estimate quarterly cost (alert hours + runbook hours + upgrade hours + knowledge hours). Estimate quarterly business value (revenue, user activity, strategic importance). The ratio tells the story.
The hard cases. Services that are operationally expensive but strategically important — keep, but invest in reducing the operational cost. Services that are operationally cheap but strategically irrelevant — archive. Services in the middle — the decision depends on team capacity.
Common antipatterns
The team that ships and never archives. Each new service adds tax; no service is ever retired. The team grows operationally heavier every quarter. Quarterly archival reviews; if a service isn't pulling its weight, deprecate.
Underestimated tax in capacity planning. Roadmap planning assumes engineers are 100% on features. Operational tax is 20-40% of capacity in mature teams. Budget for it; otherwise feature delivery slows mysteriously.
The senior who leaves with all the context. Documentation skipped; senior departs; team rediscovers what the senior knew. Always have at least 2 engineers who can on-call any service; if not, the documentation is inadequate.
Tax invisible to leadership. EM doesn't surface the operational cost; leadership thinks team is unproductive. Surface the tax in regular reporting; "30% of team capacity goes to operations" is information leadership needs to make right tradeoffs.
What to do this week
Three moves. (1) Inventory your services. Count for each: pages last quarter, runbook last validated, dependency upgrade backlog. The inventory takes 2 hours; reveals where the tax is. (2) For your most operationally expensive service, propose either a tax-reduction project (a sprint focused on alert quality + runbook fix) or an archival/merge proposal. (3) Add operational-tax estimate to your roadmap planning. Engineers can't be 100% on features; budget the tax explicitly.