Real Outage: A Thundering-Herd Reconnect
A large messaging platform planned a routine restart of its WebSocket fleet. 13 million clients reconnected at once. The exponential-backoff math fell apart on the second cycle.
Timeline
Anonymised composite of the WebSocket-reconnect-storm pattern. Real numbers from internal post-incident reviews of similar events at large messaging platforms. Times in UTC.
04:00, Maintenance window for a platform-wide TLS-library upgrade. Plan: rolling restart of the WebSocket gateway fleet (~480 nodes), 30 nodes at a time, 90 seconds between batches. Estimated window: 24 minutes.
04:00:30, First batch restarts. Affected clients (~800k) reconnect. Reconnect time per client: ~1.2 seconds. Capacity utilisation on remaining nodes: 14% headroom consumed.
04:02, Second batch restarts. The capacity model assumed clients reconnect uniformly across the surviving fleet. They did not, clients reconnect to whichever DNS record they resolve first, and the load balancer’s consistent-hashing pulled them toward the nodes furthest from the restart batch. Three nodes hit 100% CPU.
04:03, Those three nodes’ clients drop. ~140k clients now have no connection. Their WebSocket libraries do exponential backoff with a base of 250ms, max 30s, with jitter 0-100ms. Standard.
04:03:15, First reconnect attempts hit at ~250ms after disconnect. They’re rejected by the still-saturated nodes. Clients double their backoff to ~500ms. They retry at the same physical second.
04:04, Synchronised retry waves. The jitter (0-100ms) is too small relative to the wave size. Effective traffic to the gateway fleet is 13M reconnect attempts per minute when normal steady-state is ~80k reconnects per minute.
04:11, On-call halts the restart batch. Adds 200 emergency capacity nodes. Reconnect storm slowly subsides over the next 14 minutes.
04:25, Recovery confirmed. Total customer-impact: 22 minutes of degraded message delivery to ~38% of active sessions.
The detection lag
Detection wasn’t the headline failure here, the on-call saw the CPU spike at 04:03 and was paged within 90 seconds. The headline failure was that nobody knew what to do for the first six minutes. The runbook said “continue the restart batch on schedule”; the on-call followed it; the storm got worse.
The deeper signal that should have fired earlier: a counter of total active connections. During a normal rolling restart, that number dips slightly with each batch and recovers within seconds. During this incident the number flatlined and then dropped, because clients were unable to reconnect. There was no alarm on that counter, even though it was prominently displayed on the dashboard.
The cascade
The cascade was a math problem. The backoff library used delay = min(base * 2^attempts, max) + random(0, jitter). With base = 250ms, max = 30s, jitter = 100ms, the first eight reconnect attempts cluster within 100ms windows. With 140k clients all starting their backoff sequences within 1 second of each other, every retry wave hit at essentially the same time.
The backoff doubled each cycle. The fleet was getting overwhelmed by waves of 140k synchronised reconnects every 250ms, then every 500ms, then 1s, etc. The waves got further apart but they didn’t get smaller, so each wave still saturated the available capacity, the reconnect failed, and the next wave was even bigger because more clients had now failed.
The math shows up clearly: 100ms of jitter on a 250ms base is 40% noise. By the time the base reached 4s, jitter was 2.5% noise. The waves became more synchronised as backoff grew, not less.
What the runbook said
The rolling-restart runbook had two operational steps: kick off the restart, watch the dashboard. There was no step for “capacity has dropped below the line” or “reconnect storm detected”. The runbook assumed the restart was a benign operation and would always complete in 24 minutes.
What it should have had: an abort criterion. Something like “if CPU on any surviving node exceeds 80% for more than 30 seconds, halt the restart and reassess.” Without a quantitative abort, the on-call did the natural thing, let the runbook continue, for too long.
What actually fixed it
Two things: stopping the restart and adding capacity. Stopping the restart took the wave-generation pressure off the system. Adding 200 nodes (which took 4 minutes via the team’s scale-out automation) absorbed the reconnect storm. The combination drained the queue within 14 minutes.
The fix that didn’t work: telling clients to back off. The team has no realtime control over deployed mobile clients. Once 13M devices are in a synchronised retry pattern, you can’t tell them to stop. The only handle the server has is “accept the connection or refuse it”.
Action items
- Decorrelated jitter in client SDKs. New backoff:
delay = random(base, min(max, last_delay * 3)). Each client’s next delay is a random value in a growing range, not a deterministic exponential. Eliminates the wave-synchronisation effect. - Server-driven backoff hint. The gateway now returns a
Retry-Afterheader (or WebSocket-level equivalent) with a randomised value when it’s under load. Clients respect the hint. Server has direct control over the storm. - Restart-pause threshold. The rolling-restart automation now pauses if any surviving node’s CPU exceeds 75%. Can’t complete the restart while the fleet is unhealthy.
- Capacity headroom budget. The fleet is now provisioned with 35% headroom (was 20%). The restart pattern depends on this; if utilisation is consistently above 65% there’s no safe restart and the team adds capacity instead of restarting.
- Total-connections alarm. New alarm: total active connections drops more than 5% in 60 seconds. Detection floor for a future event is now 90 seconds.
The architectural change
The deeper architectural lesson was about reconnect economics. At ~13M concurrent connections, even a 1-second reconnect storm is 13M connection attempts. The system is fundamentally not designed to handle that, the steady-state is ~80k/minute. Any event that disconnects more than a few percent of clients is an emergency.
The architectural change was to stop using rolling restarts as a normal operation. The team now does “graceful drain” restarts: instead of killing a node and forcing its clients to reconnect, the node sends a WSM_DRAIN message to its clients telling them to migrate to a specific other node when convenient. Clients move over the next 30-60 seconds in a way the client schedules. No reconnect storm, no wave synchronisation.
The wiki line: “If reconnect is your control plane for capacity, capacity is one event away from breaking.” Drain-not-disconnect is now the default for any operation that affects more than 5% of the fleet at once.