Agentic SRE Advanced By Samson Tanimawo, PhD Published May 9, 2026 5 min read

Replicating a Production Incident in a Sandbox via Agent

The agent that takes an incident timeline, builds a sandbox, and reproduces the failure. Why this is harder than it sounds and the shortcut that saves 80% of the work.

Input: the incident timeline

The agent reads the timeline: alerts that fired, services affected, deploys made, actions taken, recovery achieved.

From the timeline, the agent extracts the system state at the moment of the incident: versions, configs, traffic patterns, data shape.

The state extraction is the hard part. The agent has to reason about which fields matter and which are noise.

Build the sandbox

Stand up the affected services at the version they were at the time of the incident. Use container images or commits if available.

Replay the traffic that preceded the incident. Synthetic if needed; recorded if you have it.

Inject the failure: kill a pod, slow a network link, fill a disk, whatever the timeline indicates.

Why this is harder than it sounds

Production state has thousands of dimensions. Reproducing exactly is impossible; reproducing enough is hard.

Time-of-day dependencies: an incident that happened at 3 AM may not reproduce at 3 PM. Some state is implicit in the time.

Environmental dependencies: production has things sandbox does not (specific data, specific traffic, specific bugs in specific dependencies).

The 80/20 shortcut

Skip exact reproduction; aim for behavioural reproduction. The sandbox does not have to BE production; it has to FAIL like production did.

Inject failures that produce the same symptoms. The cause can differ; the symptom is what you study.

This shortcut saves 80% of the work for 80% of the value. The remaining 20% (when exact reproduction matters for root-cause analysis) requires the full work.

What the reproducer enables

Test fixes safely: apply the proposed fix to the reproducer; verify it actually fixes.

Train the agent: the reproducer becomes an eval case for the triage and remediation agents.

Learn: the reproducer is a teaching tool for new on-calls who did not live through the original.