Back to glossary
GLOSSARY · C

Chaos Engineering

Deliberately injecting failures into production-like systems to discover weaknesses before customers do.

Definition

Chaos engineering is the practice of running controlled experiments that inject failures (kill a pod, drop network packets, fail a region) into a system to verify the system handles them. Coined at Netflix with Chaos Monkey, the discipline has matured into structured 'game days' where teams pick a hypothesis, define safe blast radius, run the experiment, and document what broke. The output is a list of fixes the team didn't know they needed.

Why it matters

Most production failures are not caused by the failure mode, they are caused by the system's response to the failure mode. A retry storm, a thundering herd on cache miss, a cascading timeout. Chaos engineering surfaces those second-order failures in a controlled window so they are fixed during a planned game day, not at 3am during the unplanned one.

How Nova handles it

See the part of the platform that handles chaos engineering in production.

Nova reliability snapshot