Real Outage: A Kafka Consumer Rebalance Storm
A rolling restart of a 240-consumer group triggered 9 minutes of continuous rebalancing. The session-timeout math was wrong by a factor of two. Throughput went to zero. Here’s the write-up.
Timeline
Anonymised composite of a Kafka consumer-rebalance-storm pattern. Times in UTC.
15:30, Routine rolling restart of a 240-consumer group reading from a 360-partition topic. Plan: restart 8 consumers at a time, wait 30 seconds between batches.
15:30:12, First batch starts. 8 consumers leave the group. Kafka triggers rebalance. Sticky assignor reassigns the affected partitions; rebalance completes in 4 seconds.
15:30:46, Second batch (8 consumers) leaves. Another rebalance. But the new consumers from batch 1 haven’t finished startup yet (took ~38 seconds for first heartbeat). The session timeout is 30 seconds. Kafka considers them dead and removes them from the group. A third rebalance fires.
15:31:30, The system is now in continuous rebalance: as fast as new consumers join, others time out and get kicked. Group throughput: zero. Lag: climbing at ~80k msg/s per partition.
15:33, Page fires. Detection: 3 minutes. The on-call sees zero consumption and no obvious error.
15:35, On-call halts the restart. Existing consumers stop dying; but the group is now in a state where 90+ consumers are mid-rebalance and the broker logs are screaming about generation IDs.
15:39, Manual intervention: increase session timeout to 90 seconds via dynamic config; force a clean rebalance by stopping the entire group and restarting cold. Rebalance completes in 47 seconds.
15:40, Consumption resumes. Lag begins draining at ~3x normal throughput. Lag fully drained at 16:04. Total customer-impact (delayed events): 34 minutes; total “zero throughput” window: 9 minutes.
The detection lag
3 minutes from rebalance start to page is fine on paper but dreadful for a streaming system. The team had a consumer-lag alarm with a 5-minute floor; the rebalance had been going for 3 minutes when the alarm finally triggered and another minute passed before notification. By that point throughput had been zero for 4+ minutes.
What was missing: a “rebalance frequency” metric. Kafka exposes kafka.consumer.coordinator.RebalanceRate and similar. The team didn’t scrape it. A simple alarm on “more than 3 rebalances in 60 seconds for a single group” would have fired at 15:31:30, two minutes earlier.
The cascade
The cascade was a feedback loop. Restart 1 caused rebalance 1 (normal). Rebalance 1 took 4 seconds. Restart 2 caused rebalance 2 (normal). But the new consumers from restart 1 were still in their startup phase, loading their schema cache, warming connections, building local state, and hadn’t sent their first heartbeat yet.
The math: session timeout 30 seconds, heartbeat interval 3 seconds. Default. Looks safe. But consumer startup including JVM warmup, schema-registry fetch, and partition-state reload took an average of 38 seconds. Every consumer that joined was at risk of timing out before sending its first heartbeat.
Once a few consumers timed out, the rebalance triggered another, more consumers got pushed out (because they were now mid-rebalance and rebalances are slow), and the system locked into a permanent rebalance state. The rate of new consumers joining was lower than the rate at which they were getting kicked.
What the runbook said
The Kafka consumer-group runbook had four pages. Page 1 was rolling-restart procedure (which is what they were following). Page 2 was “handling consumer lag” (which was a symptom but not the cause). Pages 3 and 4 covered broker-side issues (irrelevant here).
What was missing: any guidance on rebalance storms. The runbook mentioned rebalances as a side-effect of normal operations (“expect a brief blip during restart”) but had no abort criteria, no detection guidance, and no recovery procedure for a group that’s stuck rebalancing.
The on-call followed the “handling consumer lag” runbook for the first 4 minutes, which directed them to verify broker health (fine), check disk (fine), check ISR (fine). The actual problem was on the consumer side and the broker-side runbook had no signal for it.
What actually fixed it
Three things in sequence. First, halt the rolling restart so no new consumers were leaving. Second, raise the session timeout from 30 to 90 seconds via dynamic config to give startup the headroom it needed. Third, do a forced clean rebalance, stop the entire group, wait 60 seconds for the broker state to settle, restart cold.
The forced clean rebalance was the controversial step. It meant ~90 seconds of zero consumption on top of the 9 minutes already lost. The on-call paused before doing it. The decision was correct: the group was so deep into rebalance churn that gradual recovery wasn’t happening, and the cold start was the only path to a healthy generation ID.
Action items
- Session timeout raised to 90 seconds. The 30-second default never accounted for real consumer startup time. 90 gives 2.4x headroom over observed startup.
- Rebalance-frequency alarm. New alarm: more than 3 rebalances per 60 seconds for any group. Detection floor for a future event: under 90 seconds.
- Static group membership. Migrated to
group.instance.idfor static membership. Consumer restarts no longer trigger a rebalance, the consumer leaves and rejoins the same partition assignment. Eliminates the entire class of restart-rebalance interaction. - Restart batch size cut to 4. With static membership the batch size matters less, but the team kept it conservative to limit blast radius if a different bug shows up.
- Cooperative-sticky assignor. Switched from sticky to cooperative-sticky. Rebalances no longer pause all consumers; only the partitions actually moving between owners are paused. Throughput drops 5-10% during rebalance instead of 100%.
The architectural change
Static membership was the architectural answer. Before this incident, every consumer restart was a rebalance event. After: consumers have stable IDs across restarts and the group treats a restart as “the same consumer briefly disappeared” rather than “a new consumer joined”. The rebalance count during a rolling restart of 240 consumers went from 30+ to zero.
The deeper architectural lesson: rebalance is a “stop the world” event for a Kafka consumer group, and stop-the-world events should be rare. Anything that triggers them (deploys, scaling, broker partitions changes, consumer crashes) needs to be examined for “does this need to be a rebalance?” In most cases the answer is no.
The wiki line: “Rebalance is the consumer group’s emergency-stop button. If you’re hitting it during normal operations, you’ve picked the wrong tool.”