Chaos Engineering: When to Start, What to Break First, and Where to Stop
Chaos engineering gets pitched as a Netflix-scale luxury. It isn't. Here is the version that works for a 20-person team.
Signs you're ready
You are ready for chaos engineering when three things are true:
- You have SLOs and an error-budget policy.
- Your runbooks exist and are at least semi-trusted.
- You can measure the blast radius of an experiment in under a minute.
If any of these is missing, fix that first. Chaos is about building confidence in known-unknowns; it's useless when the system's baseline behaviour is already murky.
Your first experiment
Start with the easiest, least-risky failure mode your system should tolerate: kill a pod. Do it in staging first. Then do it in prod, during business hours, with the team watching dashboards.
The point isn't to break something impressive. The point is to learn whether your monitoring even sees the failure, how long Kubernetes takes to reschedule, and whether your latency SLO stays green.
Three stages of escalation
- Single-instance failures: kill one pod, restart one VM, drop one connection pool slot. Low blast radius, high learning.
- Dependency failures: inject latency into your cache, time out calls to an internal service, return 500s from a non-critical dep. Tests circuit breakers, retries, fallback paths.
- Partition failures: sever network routes between zones, drop packets on a subset of traffic, fail an availability zone. High-signal, higher blast radius; run these quarterly.
Don't skip stages. Most teams that jump straight to AZ-failover find out the hard way that stage-1 issues were hidden under the stage-3 noise.
When to stop or pause
Two hard rules, one soft one:
- Hard: halt immediately if the SLO burn rate crosses the fast-burn threshold.
- Hard: halt immediately on any user-visible errors, before investigating.
- Soft: don't run chaos experiments on Fridays. Real life will pile up over the weekend.
What the steady state looks like
Once the practice is established, run one stage-1 experiment per week, one stage-2 per month, one stage-3 per quarter. Write a short postmortem for each, whether anything broke or not, “we learned nothing” is a meaningful result.
The goal is to make chaos experiments so routine they stop feeling like experiments. That's when you know the organisation has internalised the idea.
The goal is to make chaos experiments so routine they stop feeling like experiments.
A first-month plan
Week one: write the hypothesis. 'When we kill this pod, latency stays within the SLO and the dashboard shows the failure within 60 seconds.' No action yet.
Week two: run it in staging. Confirm the hypothesis or fail it and fix the gap. Publish a short note either way.
Week three: run the same experiment in production, business hours, with the team watching. Publish the results. That cadence, sustained for three months, will teach the team more about the system than a year of reading design docs.