Incident Replay From Traces
Re-running incidents from captured traces to validate fixes. The pattern, the tooling, and the high-stakes incidents that warrant it.
The idea
Capture the trace context of an incident: services involved, sequence of events, key timing.
Recreate in pre-prod. Apply the proposed fix. Verify the fix would have prevented the incident.
When to do it
Sev 1 and sev 2 incidents. Sev 3-4 usually do not justify the engineering time.
Recurring incident classes. The replay tooling pays back when the same class hits multiple times.
Limits
Stateful systems are hard to replay. Database states differ between captures.
Replay is a complement to root-cause analysis, not a substitute.