Calculating ROI for an SRE Agent Project
Four cost lines, three benefit lines, and the assumption that ruins the math if you get it wrong. The calculator, with defaults, that gets you to a defensible number.
Benefits
The benefit side has three quantifiable terms. MTTR reduction (minutes saved per incident times incidents per month times cost-per-minute-of-downtime); on-call burden reduction (hours saved per week times hourly cost times engineers); postmortem speed (hours saved per postmortem times postmortems per month). Each is a defensible line item.
- MTTR reduction. Minutes saved per incident × incidents per month × cost per minute of downtime.
- On-call burden reduction. Hours saved per week × hourly cost × engineers; the human-hour benefit.
- Postmortem speed. Hours saved per postmortem × postmortems per month; the writeup acceleration.
- Per-benefit defensible source. Each line item has a measurement source; supports the math during budget review.
Costs
The cost side has four terms. Engineering build/maintenance (FTE-equivalent for the team); vendor or compute spend (model API calls, infrastructure); onboarding cost (training the rest of the team); risk cost (occasional agent errors that require remediation). Each must be modelled, not waved away.
- Engineering build/maintenance. FTE-equivalent; the team building and operating the agent.
- Vendor and compute spend. Model API calls, infrastructure; the variable cost.
- Onboarding cost. Training the rest of the team to work with the agent; non-trivial.
- Risk cost. Occasional agent errors that require remediation; modelled, not ignored.
The assumption that ruins the math
Most ROI calculators assume the agent handles every relevant incident. It does not, so apply a coverage multiplier. Year 1 coverage is realistically 30-50% of in-scope incidents (the other 50-70% still need humans); year 3 coverage approaches 70-90% with mature workflows; don’t model year-1 numbers as steady-state.
- Coverage multiplier mandatory. Most calculators ignore it; the agent doesn’t handle every incident.
- Year 1: 30-50%. Realistic in-scope coverage; the rest still needs humans.
- Year 3: 70-90%. With mature workflows; the steady-state target.
- Don’t use year-1 as steady-state. The math breaks if the coverage ramp is ignored.
The calculator
The formula is simple: ROI equals annual benefit times coverage minus annual cost. Default inputs: 3 engineers building at $300k each fully-loaded; 100 incidents/month relevant; 30 minutes saved per handled incident; $500/min downtime cost. These numbers cluster around “break-even in 18 months” for most teams; for regulated industries with much higher downtime cost, break-even is 6-9 months.
- ROI = (benefit × coverage) − cost. The single-line formula; everything else is inputs.
- Default inputs. 3 engineers, $300k each, 100 incidents/month, 30 min saved per incident, $500/min downtime.
- 18-month break-even cluster. Most teams; the calibration anchor.
- Regulated industry sensitivity. Higher downtime cost shrinks break-even to 6-9 months.
Be conservative on day one
Conservative claims age better. Don’t promise 80% MTTR reduction in year one because the team won’t trust it and the data won’t justify it; promise 30% in year one and 50% in year two; track actual numbers monthly and adjust the public ROI claim as data arrives. Underpromise and overdeliver makes budget approval more reliable.
- Avoid 80% year-one promise. Team won’t trust; data won’t justify.
- 30% year one, 50% year two. The conservative ramp; ages well.
- Monthly actual tracking. Adjust public claims as data arrives; the math stays honest.
- Per-quarter ROI update. The narrative updates with the data; supports continued credibility.