Kafka Consumer Lag: An Agent's Decision Tree
Lag is misleading. The signals an agent should weigh, the false positives to avoid, and the four remediations it can apply in order of reversibility.
Read the lag correctly
Reading lag correctly is the foundation. Lag is per-partition per-consumer, so “total lag” hides which partition is problematic; the relevant signal is rate-of-lag-change not absolute lag (growing lag means consumers can’t keep up, flat lag means they’re caught up at a steady offset); always pair lag with consumer throughput because high lag with high throughput is different from high lag with zero throughput.
- Per-partition per-consumer. Total lag hides which partition is problematic.
- Rate-of-change over absolute. Growing lag = falling behind; flat lag = caught up at steady offset.
- Pair with throughput. High lag with high throughput vs zero throughput are different problems.
- Per-signal context. The agent needs all three signals to triage correctly.
False positives to avoid
Three false-positive patterns deserve recognition. Brief lag spikes during deploys (consumer was paused, wait two minutes for the spike to clear); lag during partition rebalance (expected behaviour during broker membership change, the agent should detect rebalance events and not alert); lag on low-traffic partitions (1k messages of lag on a partition with 10 messages per second is fine).
- Deploy-pause spikes. Brief; wait 2 minutes; usually clears.
- Rebalance lag. Expected during broker membership change; agent detects and skips.
- Low-traffic partition lag. 1k lag on 10 msg/s partition is fine; absolute lag misleads.
- Per-pattern detection. Each false-positive pattern has its own signature; the agent recognises before alerting.
Four remediations in order
The remediation order goes from reversible to irreversible. Scale consumers up first (adds capacity, reverses by scaling back); re-balance the consumer group (reversible by triggering another rebalance); drain in-flight buffers (loses some throughput but does not lose data); skip the partition last (advance offset without processing, irreversible, data is lost, requires explicit human approval).
- 1. Scale up. Most reversible; adds capacity; reverses by scaling back.
- 2. Re-balance group. Reversible by triggering another rebalance.
- 3. Drain buffers. Commit offsets, pause to clear; loses throughput, not data.
- 4. Skip partition. Irreversible; data lost; explicit human approval required.
Decision tree
Four branches cover most scenarios. Lag growing AND consumer healthy AND throughput high: capacity issue, scale up. Lag growing AND consumer unhealthy: consumer problem, restart then investigate. Lag flat: probably caught up, verify with offset position and don’t act. Lag massive AND consumer healthy AND throughput zero: stuck, investigate the consumer’s processing logic.
- Growing lag, healthy, high throughput. Capacity issue; scale up.
- Growing lag, unhealthy consumer. Consumer problem; restart, then investigate.
- Flat lag. Probably caught up; verify with offset position; do not act.
- Massive lag, zero throughput. Stuck consumer; investigate processing logic.
Eval cases
Four eval cases prove the agent works. Real lag spike: agent identifies and recommends scaling. Rebalance event: agent identifies and does not alert. Stuck consumer: agent identifies and recommends restart. False alarm (low-traffic partition): agent identifies and dismisses.
- Real lag spike. Agent identifies; recommends scaling.
- Rebalance event. Agent identifies; does not alert.
- Stuck consumer. Agent identifies; recommends restart.
- False alarm. Low-traffic partition; agent identifies and dismisses.