Kafka Consumer Lag: An Agent's Decision Tree
Lag is misleading. The signals an agent should weigh, the false positives to avoid, and the four remediations it can apply in order of reversibility.
Read the lag correctly
Lag is per-partition per-consumer. "Total lag" hides which partition is problematic.
The relevant signal is rate-of-lag-change, not absolute lag. A growing lag means consumers cannot keep up; a flat lag means they are caught up at a steady offset.
Always pair lag with consumer throughput. High lag with high throughput is different from high lag with zero throughput.
False positives to avoid
Brief lag spikes during deploys: the consumer was paused. Wait two minutes; the spike usually clears.
Lag during partition rebalance: expected behaviour during a broker membership change. The agent should detect rebalance events and not alert.
Lag on low-traffic partitions: 1k messages of lag on a partition with 10 messages per second is fine.
Four remediations in order
Most reversible first: scale consumers up. Adds capacity; reverses by scaling back.
Next: re-balance the consumer group. Reversible by triggering another rebalance.
Then: drain in-flight buffers (commit offsets, pause to clear). Loses some throughput but does not lose data.
Last: skip the partition (advance offset without processing). Irreversible; data is lost; requires explicit human approval.
Decision tree
Lag growing AND consumer healthy AND throughput high: capacity issue. Scale up.
Lag growing AND consumer unhealthy: consumer problem. Restart, then investigate.
Lag flat: probably caught up. Verify with offset position; do not act.
Lag massive AND consumer healthy AND throughput zero: stuck. Investigate the consumer's processing logic.
Eval cases
Real lag spike: agent identifies and recommends scaling.
Rebalance event: agent identifies and does not alert.
Stuck consumer: agent identifies and recommends restart.
False alarm (low-traffic partition): agent identifies and dismisses.