Detection Time vs Response Time vs Resolution Time
MTTR is one number. The minute you decompose it, the right work to do becomes obvious. The decomposition that almost every team is missing.
MTTR is the wrong number
MTTR, mean time to recovery, is the headline. It tells you how long incidents lasted on average. It does not tell you why. Two teams with identical MTTR can have entirely different problems: one might be slow to detect but fast to mitigate, the other fast to detect but slow to fix. Same headline number, completely different work to do.
The deeper issue: MTTR is an outcome, not a lever. You can't directly improve MTTR; you can only improve the things that compose it. Treating MTTR as the metric you optimise leads to weird incentives, teams that game the number by closing incidents prematurely, downgrading severity to take incidents off the count, or marking flaps as "not real incidents."
The right framing: MTTR is the dashboard tile, but the work happens at the level of its components. The team that knows where its time bleeds is the team that improves; the team that just stares at MTTR is the team that argues about whether the number is going up or down.
The four sub-metrics
Decompose MTTR into four pieces, each measured separately. Each maps to a different kind of improvement work; mixing them is how teams end up doing the wrong project for two quarters.
- MTTD (detect): incident start → alert fires. Measures monitoring quality.
- MTTA (ack): alert fires → human acknowledges. Measures on-call rotation health and alert routing.
- MTTM (mitigate): ack → user impact stops. Measures runbook quality and operational muscle.
- MTTR (resolve): mitigate → all-clear posted. Measures verification confidence.
Track all four in a single dashboard with the same time range. The proportions tell you where the work is. Teams looking at this dashboard for the first time are usually surprised, what they thought was a detection problem turns out to be a mitigation problem, or vice versa.
Detection (MTTD)
If MTTD is high, you have a monitoring problem. The fix is better SLOs, faster anomaly detection, or end-user synthetic monitoring. No amount of on-call training will help, the problem is upstream of the on-call rotation.
Common cases. (1) The metric exists but the alert threshold is too loose ("error rate over 5% for 5 minutes" misses a 4.8% rate that's killing customers). Tighten the threshold. (2) The metric doesn't exist yet, symptoms come from a code path nobody instrumented. Instrument it. (3) The alert fires but goes to the wrong place (an unread Slack channel, a deprecated email list). Fix the routing.
The leverage move on MTTD: synthetic checks against the four critical user journeys. Synthetic checks fire 1-2 minutes after a regression starts; they don't depend on real users hitting the bad path. Most teams that move from "monitoring real traffic" to "monitoring + synthetic" cut MTTD by 30-60%.
Acknowledgement (MTTA)
If MTTA is high, you have an alerting problem. Pages going to the wrong person, escalation policies that don't escalate, or alerts so noisy the team has tuned them out. Fix the routing, fix the noise, before anything else.
Healthy MTTA is under 5 minutes for SEV1, under 10 for SEV2. If your numbers are higher, audit the last 20 alerts that took longer than the target. The pattern is usually one of three things: (1) the on-call schedule had a gap (someone went on vacation, the rotation didn't cover it), (2) the page went to a stale device (old phone, old number), (3) the on-caller was in a meeting and didn't see the page.
The leverage move on MTTA: tested escalation. Once a quarter, send a real test page at 2am. The page bypasses the chain only if no one acknowledges within the budgeted total. The test reveals every silent failure mode in one round.
Mitigation (MTTM)
If MTTM is high, you have an operational problem. Runbooks are missing, the team can't access the systems they need, the rollback procedure is theoretical. Fix the muscle memory.
The diagnostic. Pull the last 10 SEV2+ postmortems and look at the time between "engineer ack'd" and "user impact stopped." If it averages over 20 minutes, you're not in good shape. Read each postmortem's "what was the team doing during this window", almost always you'll find: looking up runbooks, requesting access to a system, debugging the alert because the alert message wasn't actionable, or making decisions that should have been baked into the runbook.
The leverage move on MTTM: a run-book quality grading exercise. For each on-call runbook, ask: "could a competent engineer who has never seen this run it at 3am?" If no, the runbook is decoration. Most teams find half their runbooks fail this bar.
Resolution (MTTR)
If MTTM is fast but MTTR (mitigation-to-all-clear) is long, the team is struggling to confirm the incident is over. The user impact has stopped but the team isn't sure, so they leave the bridge open another 30 minutes "just in case." Multiplied across incidents, that's hours per month.
Common pattern: the alert that fired during the incident is still flapping during recovery, even though customers are no longer affected. The team can't tell from the alert whether they're truly clear or in a brief recovery dip. The fix: a separate "verification" check that's stricter than the alert, monitors only the critical user paths, and goes green only when the system is genuinely clear.
The leverage move on resolution: synthetic check + user-traffic check + dashboard with explicit "all clear" criteria, agreed in advance. Without it, "are we clear?" is a vibes-based assessment that takes 20 minutes and ends with the most confident person winning.
Which to attack first
Look at where time is bleeding. Detection bleeders fix monitoring; ack bleeders fix paging; mitigation bleeders fix runbooks. The work is different in each case. Knowing which one your team has is half the project.
The sequencing rule. MTTD before MTTA before MTTM before MTTR. Reason: improving MTTM (runbooks) when MTTD is broken means the runbooks fire late; improving MTTR (verification) when MTTM is broken means the team is still flailing on the fix. Each component depends on the one before it.
The number that's MOST commonly the worst is MTTD. Most teams underinvest in monitoring quality and overinvest in incident-response training. The training matters; the training without good monitoring just makes the team faster at responding to alerts that fire too late.
What to do this week
Three moves. (1) Pull last quarter's incidents, compute MTTD/MTTA/MTTM/MTTR for each, plot the four as stacked bars. The visual tells you which component dominates. (2) Pick the worst component. Schedule a single sprint of work targeted at it. Don't try to fix all four at once; sequencing wins. (3) Set explicit targets per component (e.g., "MTTD < 3 min for critical paths, MTTA < 5 min, MTTM < 20 min, total MTTR < 45 min for SEV2"). Measure them weekly. The metric that has a target is the metric that improves.