Kafka Consumer Lag: An Agent's Decision Tree

Lag is misleading. The signals an agent should weigh, the false positives to avoid, and the four remediations it can apply in order of reversibility.

Read the lag correctly

Reading lag correctly is the foundation. Lag is per-partition per-consumer, so “total lag” hides which partition is problematic; the relevant signal is rate-of-lag-change not absolute lag (growing lag means consumers can’t keep up, flat lag means they’re caught up at a steady offset); always pair lag with consumer throughput because high lag with high throughput is different from high lag with zero throughput.

False positives to avoid

Three false-positive patterns deserve recognition. Brief lag spikes during deploys (consumer was paused, wait two minutes for the spike to clear); lag during partition rebalance (expected behaviour during broker membership change, the agent should detect rebalance events and not alert); lag on low-traffic partitions (1k messages of lag on a partition with 10 messages per second is fine).

Four remediations in order

The remediation order goes from reversible to irreversible. Scale consumers up first (adds capacity, reverses by scaling back); re-balance the consumer group (reversible by triggering another rebalance); drain in-flight buffers (loses some throughput but does not lose data); skip the partition last (advance offset without processing, irreversible, data is lost, requires explicit human approval).

Decision tree

Four branches cover most scenarios. Lag growing AND consumer healthy AND throughput high: capacity issue, scale up. Lag growing AND consumer unhealthy: consumer problem, restart then investigate. Lag flat: probably caught up, verify with offset position and don’t act. Lag massive AND consumer healthy AND throughput zero: stuck, investigate the consumer’s processing logic.

Eval cases

Four eval cases prove the agent works. Real lag spike: agent identifies and recommends scaling. Rebalance event: agent identifies and does not alert. Stuck consumer: agent identifies and recommends restart. False alarm (low-traffic partition): agent identifies and dismisses.