Simulation Engine runs a runbook plan against a digital twin of your environment. It simulates the effect at each step, predicts the SLI movement, and reports the expected outcome. Use it to validate that a remediation will actually fix the SLI, or to catch a runbook that would have broken something the planner did not consider.
The digital twin is a snapshot of your real environment: service graph, current load, current SLIs, current resource utilization. The simulation engine applies the planned runbook step by step against the twin and recomputes predicted SLIs after each step. The twin is a snapshot, it does not affect production. The simulation runs in seconds.
After each runbook step, the engine predicts the SLI values: p95 latency, error rate, saturation, custom SLIs. The prediction uses the same models that drive Predictive Detection. A simulation that predicts an SLO breach is a flagged simulation; the engine recommends adjusting the plan before proceeding.
Simulate when the change is big (scale operations, mass restarts, schema migrations), when the change is ambiguous (when multiple agents disagreed in debate), or when the change is a recovery plan during an active incident. For routine work, simulation is overkill, the engine knows when to suggest itself.
For every simulated runbook that actually runs, the engine compares predicted vs actual SLI movement and reports calibration. Good calibration (predictions within ±10%) builds trust; poor calibration triggers model review. The engine's precision is itself a meta-SLI on Service Health Matrix.
Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.
Simulation is a 10-second check that catches "actually, this would break payments" before the real deploy.