Incident Replay From Logs
Some incidents are reproducible. The replay technique that makes fix-validation reliable.
Capture inputs
Replay starts with reconstructing the request sequence that caused the failure, pulled directly from production logs. The work that matters here is preserving the timing, ordering, and payload shape while stripping anything that should not leave production.
- Identify the failing request sequence. Pulled from the production log stream around the incident window. The exact set of inputs that triggered the fault.
- Strip PII before storage. User IDs, emails, payment data redacted. The failure pattern survives; the personal data does not.
- Preserve the failure shape. Timing, ordering, and payload structure are part of the bug. Sanitising too aggressively erases what you came to debug.
- Documented retention plus storage location. Time-bounded storage with a deletion policy; per-replay the location and access controls live in the runbook.
Replay
The captured inputs run through a pre-production environment. The first run verifies the failure reproduces; the next runs validate the fix.
- Run captured inputs against pre-prod. Safe-environment exercise that confirms the failure reproduces outside production.
- Apply the fix and rerun. Same captured inputs, fixed code. The pass tells you the fix is real, not theoretical.
- Captured replay becomes a regression test. Promote the script into the test suite so the next refactor cannot reintroduce the bug.
- Documented setup script. Per-replay the runnable script lives next to the postmortem. Future incidents reuse the harness.
Limits
Replay does not work for everything. Stateful systems and timing-dependent failures resist clean reproduction; honesty about the gap is what keeps replay credible.
- Stateful systems are hard. Database, cache, and queue state are difficult to capture and restore faithfully. Replay covers the request path, not the full state.
- Timing-dependent failures. Race conditions, lock contention, and network jitter rarely reproduce on demand. Note them explicitly rather than pretending they replayed cleanly.
- Documented “what we could not replay”. Per-incident the explicit gap note. The discipline that prevents false confidence in the fix.
- Canary fallback when replay is incomplete. A small, monitored production canary covers what replay could not. Belt and braces on risky fixes.