The average time from an incident starting to the service being restored to normal, the headline reliability metric for SRE teams.
MTTR, Mean Time to Resolve (sometimes Repair, sometimes Recover), is the average elapsed time between an incident starting and the service being restored to its normal state. MTTR includes detection time (MTTD), acknowledgment time, diagnosis time, action time, and verification time. Each phase is independently optimizable: faster monitoring, smaller on-call rotations, better runbooks, agent-driven remediation, post-fix smoke tests.
MTTR is the headline number on every SRE dashboard because it directly maps to user pain. A 47-minute MTTR means 47 minutes of customers seeing errors, 47 minutes of on-call engineers in war rooms, 47 minutes of reputation damage. Industry-leading MTTR for routine pages is now under 10 minutes, and Agentic SRE platforms are pushing it under 3 for known patterns.
See the part of the platform that handles mttr (mean time to resolve) in production.