Replicating a Production Incident in a Sandbox via Agent
The agent that takes an incident timeline, builds a sandbox, and reproduces the failure. Why this is harder than it sounds and the shortcut that saves 80% of the work.
Input: the incident timeline
The reproducer agent reads the incident timeline first. The timeline plus the system state at impact time is what makes the rest of the work possible.
- Timeline ingest. Alerts that fired, services affected, deploys made, actions taken, recovery achieved. The structured timeline is the input.
- State extraction. From the timeline, extract system state at the moment of the incident: versions, configs, traffic patterns, data shape.
- Hard part. The state extraction is the hard part. The agent has to reason about which fields matter and which are noise.
- Citation requirement. Each extracted state field cites the timeline event that established it. The audit trail is what makes the reproducer reviewable.
Build the sandbox
Building the sandbox is mostly mechanical once the state is extracted. The three steps below cover almost every reproducible incident shape.
- Re-stand the services. Stand up the affected services at the version they ran during the incident. Use container images or commits when available.
- Replay traffic. Replay the traffic that preceded the incident. Synthetic if needed; recorded from production if you have a tap.
- Inject the failure. Kill a pod, slow a network link, fill a disk, whatever the timeline indicates was the trigger.
- Verify the symptom. Confirm the sandbox produces the same observable signal the on-call saw. If not, the reproducer has missed something.
Why this is harder than it sounds
Reproducing production is not the same as copying production. Three classes of dependency defeat naive replay almost every time.
- Dimensional explosion. Production state has thousands of dimensions. Reproducing exactly is impossible; reproducing enough is hard.
- Time-of-day dependencies. An incident that hit at 3 AM may not reproduce at 3 PM. Some state is implicit in the time of day.
- Environmental dependencies. Production has things sandbox does not: specific data, specific traffic, specific bugs in specific dependency versions.
- Coupling to other tenants. Some incidents need a noisy neighbour to fire. Single-tenant sandboxes never see the trigger.
The 80/20 shortcut
Most reproducer value comes from behavioural reproduction, not exact reproduction. The shortcut below produces a useful sandbox in hours instead of weeks.
- Behavioural reproduction. The sandbox does not have to be production; it has to fail like production did.
- Symptom-first injection. Inject failures that produce the same symptoms. The cause can differ; the symptom is what you study.
- 80 percent value, 20 percent effort. The shortcut saves 80 percent of the work for 80 percent of the value. The remaining cases require full reproduction.
- Reserved exact path. Keep the full-reproduction path for root-cause analyses where the symptom alone is not enough.
What the reproducer enables
The reproducer is not just a debugging artefact. It becomes a fixture the team uses for fixes, evals, and onboarding.
- Safe fix testing. Apply the proposed fix to the reproducer; verify it actually fixes the symptom before touching production.
- Agent eval case. The reproducer becomes an eval case for the triage and remediation agents. Replays catch regressions cheaply.
- Onboarding tool. New on-calls who did not live through the original incident learn from the reproducer instead of from prose alone.
- Regression guard. Future deploys re-run the reproducer set to confirm a known incident shape cannot recur silently.