Chaos Engineering with LitmusChaos
Chaos engineering: enough to see your first experiment, not yet enough to design a program.
Step 1: Install Litmus
helm install litmus litmuschaos/litmus -n litmus --create-namespace
Wait for pods; access UI via port-forward.
Step 2: Pick experiment
- ChaosHub: pod-delete, network-loss, cpu-hog, etc.
- Pick pod-delete on a stateless deployment first.
Step 3: Run experiment
Configure the experiment, point it at a target, and let Litmus orchestrate the failure injection on schedule.
- Target. Pick a deployment or pod selector; start with a stateless workload that has multiple replicas.
- Scale. Set the percentage of pods affected; 50% is a good first run, full kill comes later.
- Duration. 30 to 60 seconds is enough to observe recovery without obscuring the signal.
- Run. Apply the ChaosEngine CR; Litmus injects failure on schedule and emits events as it runs.
Step 4: Read the result
Probes are how chaos becomes a test rather than a stunt. They assert the system kept doing its job during the failure.
- HTTP probe. Verifies the service kept serving traffic during the chaos window.
- Cmd probe. Verifies metrics, logs, or external state match expectations.
- Pass. System absorbed the chaos; record the experiment in the catalogue and schedule a re-run.
- Fail. Investigate, fix the root cause, then re-run; a one-time fix without re-run does not prove the regression is gone.
Antipatterns
- Production chaos on day one. Start in staging.
- No probes. Cannot verify outcome.
- One-off chaos. Pattern is repeated experiments.
What to do this week
Three moves. (1) Run the tutorial end-to-end on your own laptop / sandbox. (2) Apply the pattern to one production workload. (3) Document the variations you needed; share with the team.