The Synthetic Incident Generator: How to Build One
When you cannot wait for real incidents, generate them. The patterns, the realism dial, and the bias traps to avoid when synthetic data drives agent training.
When synthetic data is justified
Real incidents are slow to accumulate. A new service that has not paged yet has zero real cases. Synthetic incidents fill the gap so the agent can be evaluated before the first real page.
Synthetic data is also useful for rare scenarios: the kind of incident you have seen twice in five years. You cannot wait for a third occurrence; you generate variants and test now.
Synthetic data is not a substitute for real data; it is a complement. The minute real incidents accumulate, prefer them. Synthetic data has structural biases that real data does not.
Build templates, not one-offs
A template parameterises an incident type: "database-latency-spike" with parameters for the database engine, the service, the spike magnitude, the time of day, the contributing factor.
Sample parameters from realistic distributions. Time of day skewed toward business hours, database engines weighted by what your fleet actually runs, spike magnitudes following a power law.
Templates compose. "Latency spike during a deploy" combines two simpler templates. Composition is how you cover the long tail with a manageable number of templates.
Realism dial
Set the dial to match your purpose. Too realistic and the synthetic data is indistinguishable from real, which is wasteful (real data is better). Too unrealistic and the agent learns patterns that do not exist in production.
The dial controls: noise in metrics, plausibility of causes, presence of misleading signals. Each dimension is configurable.
Validate by sampling. Show generated incidents to a senior on-call; ask them to rate plausibility. Tune the dials until ratings cluster around "realistic enough."
Bias traps to avoid
Trap one: every synthetic incident has a clear cause. Real incidents often do not. Generate "unclear cause" cases deliberately; the agent should be able to say "I cannot determine this."
Trap two: every synthetic incident is single-service. Real incidents often span services. Generate cross-service cases; they expose different failure modes in the agent.
Trap three: every synthetic incident has a happy ending. Real incidents sometimes fail despite best effort. Generate cases the agent cannot fix; the right behaviour is escalation.
Retire synthetic cases as real cases arrive
Every synthetic case that has a real-incident equivalent should be retired. Real data is always preferred. Treat synthetic as a placeholder for the corpus you do not have yet.
Track the synthetic-to-real ratio. Year one might be 80% synthetic. Year three should be under 30%. If the ratio is not declining, the team is not gathering real cases effectively.
Document why each synthetic case exists. "Generated because we have not had a real cross-region failover yet" is a sentence worth preserving until the day you do.