Chaos Engineering with LitmusChaos

Chaos engineering: enough to see your first experiment, not yet enough to design a program.

Step 1: Install Litmus

helm install litmus litmuschaos/litmus -n litmus --create-namespace

Wait for pods; access UI via port-forward.

Step 2: Pick experiment

ChaosHub: pod-delete, network-loss, cpu-hog, etc.
Pick pod-delete on a stateless deployment first.

Step 3: Run experiment

Configure the experiment, point it at a target, and let Litmus orchestrate the failure injection on schedule.

Target. Pick a deployment or pod selector; start with a stateless workload that has multiple replicas.
Scale. Set the percentage of pods affected; 50% is a good first run, full kill comes later.
Duration. 30 to 60 seconds is enough to observe recovery without obscuring the signal.
Run. Apply the ChaosEngine CR; Litmus injects failure on schedule and emits events as it runs.

Step 4: Read the result

Probes are how chaos becomes a test rather than a stunt. They assert the system kept doing its job during the failure.

HTTP probe. Verifies the service kept serving traffic during the chaos window.
Cmd probe. Verifies metrics, logs, or external state match expectations.
Pass. System absorbed the chaos; record the experiment in the catalogue and schedule a re-run.
Fail. Investigate, fix the root cause, then re-run; a one-time fix without re-run does not prove the regression is gone.

Antipatterns

Production chaos on day one. Start in staging.
No probes. Cannot verify outcome.
One-off chaos. Pattern is repeated experiments.

What to do this week

Three moves. (1) Run the tutorial end-to-end on your own laptop / sandbox. (2) Apply the pattern to one production workload. (3) Document the variations you needed; share with the team.