Real Outage: A Redis Cluster Split-Brain
A 90-second cross-AZ network blip during cluster failover produced two primaries serving the same key range. Four minutes of dual writes. The reconciliation script and the quorum-config rule that came out of it.
Timeline
Anonymised composite of a Redis Cluster split-brain pattern. Times in UTC.
02:14:08, Cross-AZ network experiences a 90-second connectivity issue between AZ-A and AZ-C. Redis cluster has 3 primaries, 3 replicas, distributed across 3 AZs.
02:14:18, Replicas in AZ-C lose contact with primary in AZ-A. After the configured failover timeout (10 seconds), the replica in AZ-C promotes itself to primary for the affected slot range.
02:14:28, Now both nodes are primary for the same slot range. Application clients in AZ-A keep talking to the original primary (still reachable to them); clients in AZ-C talk to the new primary. Writes diverge.
02:15:38, Network heals. The cluster sees two primaries claiming the same slot range. The cluster manager picks one (the one with the higher epoch, which is the new primary in AZ-C) and demotes the other.
02:15:39, The demoted primary discards its uncommitted writes and resyncs from the new primary. Approximately 4 minutes of writes from the AZ-A clients are silently lost.
02:18, Inconsistency-detection job fires from a downstream service that compares Redis state to the durable store. Detection: 3 minutes 22 seconds.
02:42, Engineering identifies the timeline. Reconciliation begins from the durable store (the system writes through to a durable backing store; the lost writes are recoverable but not automatic).
03:51, Reconciliation complete. ~84,000 affected user-session keys were rebuilt from the durable store. Total impact: 4 minutes of writes lost, 1 hour 33 minutes of remediation, no permanent data loss because the durable store had it.
The detection lag
3 minutes 22 seconds is acceptable for a data-consistency failure. Most teams do worse. The deeper failure was that nothing alarmed during the 90-second network event itself. The team had cluster-health alerts, but they were tuned for “node unreachable” (which fired briefly during the event) rather than “cluster has multiple primaries claiming the same slot” (which is a different and much worse signal).
What was missing: a cluster-topology consistency check running every few seconds, alarming on any state where two nodes report themselves as primary for an overlapping slot range. The Redis admin command exposes this; it just wasn’t being polled.
The cascade
The cascade was the application-side trust in Redis. The application was using Redis as a session store with a write-through to a durable database. The intent was “Redis is fast, the database is durable”. The implicit assumption was “Redis writes that succeed will eventually be visible from any read”. That assumption broke for 4 minutes.
The user-visible effect: people who logged in or updated a session preference during the split-brain window saw their changes mysteriously revert when their session shifted to a different cluster node. About 84,000 affected sessions, mostly noticeable as “why did my settings change” user reports.
The harder cascade was downstream: the inconsistency-detection job that fired at 02:18 paged a different team, who paged the platform team, who paged the cache team. The cache team realised they had a split-brain on their hands but didn’t initially know which writes were lost, the cluster had silently discarded them at 02:15:39 with no audit trail.
What the runbook said
The Redis runbook had “cluster failover” documented (procedure: trigger manual failover, verify, done). It did not have “cluster recovered from split-brain” as a scenario. The implicit assumption was that Redis Cluster prevents split-brain via its quorum mechanism, and that’s mostly true except when the network partition is longer than the failover timeout and shorter than the application’s connection timeout.
The on-call ran the failover runbook initially because cluster-health was showing “previous primary now replica”. The runbook said “this is normal during failover, no action needed”. They closed the page. The actual problem was that two primaries had existed for 70+ seconds, which is not what the runbook addressed.
What actually fixed it
Two parts. Part one was confirming the data was in the durable store: the team queried the database for all sessions modified in the affected window and confirmed all 84,000 sessions had the correct data in the durable store. The Redis state was wrong, but the source-of-truth was correct.
Part two was rebuilding the Redis state. Rather than try to merge or pick winners, the team flushed the affected slot range and refilled it from the database. The application has explicit cache-miss handling so the brief period where reads returned “cache miss” was tolerable; users saw 50ms of extra latency and nothing else.
The reconciliation script was 60 lines. The hard part was identifying which keys had been affected (anything in the slot range with a TTL that bridged the split-brain window). The team built a script for that scan; it took 8 minutes to run.
Action items
- Cluster-topology consistency probe. New monitor running every 10 seconds: query
CLUSTER NODESfrom each replica and assert exactly one primary per slot range. Detection floor for a future event: under 30 seconds. - Failover timeout raised to match network-blip 99th percentile. Was 10 seconds; raised to 45. The team accepted that genuine primary failures will now take 45 seconds to recover, a worse RTO, in exchange for not splitting the brain on a normal network blip. Trade-off explicitly documented.
- Application-side write-confirmation. Critical writes now wait for an explicit ack from at least one replica before returning success to the caller. Catches the “wrote to old primary that’s about to be demoted” case. Adds 3-5ms latency on writes; deemed worth it.
- Audit trail on cluster topology changes. Every promotion, demotion, and slot reassignment is now logged to a durable audit stream. The post-incident question “exactly when did this primary change” takes seconds to answer instead of hours.
- Quorum-aware failover config. Failover now requires a majority of cluster nodes to agree before promotion. The previous config allowed promotion based on local view; the new config can’t split-brain in a 3-AZ deployment because no single AZ has a majority.
The architectural change
The architectural change was tightening the quorum config. Redis Cluster’s default failover relies on the failing-over node being able to contact a majority of cluster nodes; the default config in this team’s deployment was permissive enough that a partitioned node could promote itself based on its local view.
The new config requires the promoting node to confirm its view of the cluster topology with at least N/2+1 other nodes before declaring itself primary. In a 3-AZ deployment with 6 total nodes, this means a single AZ (2 nodes) cannot promote on its own; it needs to reach a node in another AZ. During a clean AZ partition, promotion doesn’t happen; the cluster correctly sees one AZ as down and waits.
The pushback was “this hurts our RTO during real failures”. The response: a 30-second longer RTO is acceptable; a split-brain that loses 4 minutes of writes is not. The trade-off is explicit and the team revisits it quarterly.
The wiki line, which now opens every Redis design review: “A cache that loses writes is not a cache. A cache that confidently lies about which writes succeeded is the bug we’re trying to prevent.”