Deliberately injecting failures into production-like systems to discover weaknesses before customers do.
Chaos engineering is the practice of running controlled experiments that inject failures (kill a pod, drop network packets, fail a region) into a system to verify the system handles them. Coined at Netflix with Chaos Monkey, the discipline has matured into structured 'game days' where teams pick a hypothesis, define safe blast radius, run the experiment, and document what broke. The output is a list of fixes the team didn't know they needed.
Most production failures are not caused by the failure mode, they are caused by the system's response to the failure mode. A retry storm, a thundering herd on cache miss, a cascading timeout. Chaos engineering surfaces those second-order failures in a controlled window so they are fixed during a planned game day, not at 3am during the unplanned one.
See the part of the platform that handles chaos engineering in production.