Availability Zone Isolation Test
AZ failures are tested by chaos engineering. The test scenario, the metrics to watch, and the bugs it has caught.
The scenario
The availability zone isolation test is the chaos engineering exercise that validates whether a multi-AZ deployment actually survives the loss of an AZ. Most teams claim multi-AZ readiness; few have tested it. The test is the discipline that converts the claim into demonstrated fact.
What the scenario looks like:
- Block all traffic to one AZ at the network layer.: Network ACLs or security group changes block traffic to and from the test AZ. Workloads in the AZ are unreachable from outside; workloads outside the AZ cannot reach in.
- Workloads in that AZ are unreachable.: The blocked AZ behaves as if its physical infrastructure failed. Pods cannot reach the network; load balancers cannot reach pods in the AZ; the AZ effectively does not exist for the duration of the test.
- Other AZs continue serving.: The remaining AZs handle all traffic. The system should continue operating; users should not see degradation; the on-call should not be paged.
- Plan the test window.: The test happens during a planned window. Stakeholders are informed; rollback is prepared; the team is ready to abort if real customer impact appears.
- Reverse the block.: When the test ends, the network blocks are removed. The AZ comes back; replication catches up; the system returns to normal multi-AZ operation.
The scenario is straightforward in theory; the execution requires preparation. The team's investment in the test produces high-value validation.
Metrics
The test is judged by metrics. Specific success criteria are defined in advance; the test is a pass or fail against those criteria. Without explicit metrics, the test is an opinion exercise.
- Service latency p99.: The 99th percentile latency should not spike beyond a defined threshold during the test. Some increase is acceptable (the surviving AZs are handling more load); a large spike indicates capacity problems.
- Should not spike beyond N% during the test.: The N percentage is defined in advance. 20% is typical; 10% is aggressive; 50% is loose. The threshold reflects the team's tolerance for degradation during AZ failure.
- Error rate: should remain at baseline.: Errors should not increase materially. If error rate spikes, the system is not handling the AZ loss cleanly; some requests are failing rather than routing to surviving AZs.
- Capacity.: The remaining AZs should absorb the additional load without saturation. CPU, memory, network capacity in surviving AZs should stay within healthy ranges. Saturation indicates the architecture is not sized for AZ loss.
- Remaining AZs should absorb the load.: The capacity check is what validates the multi-AZ sizing. If the remaining AZs cannot absorb the load, the architecture has failed even if other metrics look acceptable; the next AZ failure would compound.
The metrics are what convert the test from a feeling-based exercise to a data-based validation. Without them, the test produces hand-waving conclusions.
Bugs found
The test almost always finds bugs. Each test produces specific findings; the value of the test is in the bugs found and fixed before they cause real outages.
- Single-AZ databases that nobody noticed.: A common finding is a database that was supposed to be multi-AZ but is actually single-AZ. The Terraform looks right; the configuration drifted; nobody noticed until the test brought down the AZ that contained the database.
- Quorum issues during the partition.: Distributed systems with quorum requirements can fail when a partition reduces the available votes. The test surfaces these; the team adjusts replica counts or quorum thresholds.
- Sticky DNS records that did not honour health checks.: DNS records that should have removed the affected AZ did not propagate quickly enough. Clients continued to reach the dead AZ; the team observes the lag and tightens TTL or improves health-check propagation.
- Capacity overruns.: Surviving AZs sometimes cannot absorb the full load. CPU pegs; latency spikes; some requests fail. The team finds the bottleneck and adds capacity or optimizes the workload.
- Postmortem each finding.: Each bug found in the test is postmortemed. Why was it not caught earlier? What changes prevent recurrence? The discipline produces continuous improvement; the next test starts from a better baseline.
Availability zone isolation test is the chaos engineering exercise that distinguishes claimed multi-AZ from actual multi-AZ. Nova AI Ops integrates with chaos engineering tools, runs scheduled AZ isolation tests, and produces the per-test report that drives architectural improvement over time.