Hot-Loop Detection in Production Code
Some loops run too often, eating CPU and producing log spam. The detection patterns and the fixes.
Symptoms
Hot loops are code paths that execute thousands or millions of times more than intended. The cause is usually a logic error: a missing breaking condition, a retry without backoff, a polling loop with too-short intervals. The effect is CPU saturation, log volume explosion, and downstream API rate-limit problems. Detecting hot loops early prevents them from becoming production incidents.
What hot loop symptoms look like:
- CPU pegged on a single instance.: A specific instance shows sustained CPU usage near 100%. Other instances are normal. The pattern suggests something specific to that instance: a particular request triggered a hot loop, or a stuck process is consuming the CPU.
- Specific log line repeated thousands of times per minute.: A log line that should appear occasionally is appearing constantly. The volume is the signal; legitimate logging is bounded by request rate, hot loops are bounded only by CPU.
- Same code path in many traces with high frequency.: Distributed tracing shows the same code path appearing in many traces, or one trace with the path repeating internally. The trace data localizes the loop to specific code; the high frequency confirms the pathological pattern.
- API rate limits hit.: Hot loops calling external APIs trigger rate limit responses. The rate limit is the canary; the hot loop produces calls faster than any reasonable workload should.
- Memory growth.: Some hot loops accumulate state per iteration. The memory grows continuously; eventually the process runs out and crashes. The crash is the result; the loop is the cause.
The symptoms are recognizable once the team knows to look for them. The challenge is detecting them before they cause production impact.
Detection
Detection of hot loops uses a combination of log-volume analysis and CPU profiling. Each catches a different class of hot loop; the combination produces broad coverage.
- Per-line log frequency dashboard.: A dashboard counts the frequency of each unique log line. Lines repeating more than 1000 times per minute are flagged. The volume threshold catches hot loops that produce log output; the dashboard makes them visible at a glance.
- Lines over 1000 per minute are suspect.: Most legitimate log lines occur at rates bounded by request rate. A line at 1000 per minute is suspicious; a line at 10,000 per minute is almost certainly pathological. The threshold provides a useful heuristic.
- CPU profiling.: Sample-based CPU profilers (perf, async-profiler, py-spy, similar) show which functions consume the most CPU. Hot loops surface as functions with disproportionate CPU usage; the profile identifies the loop's location.
- Sample profiling reveals hot functions.: The profiler samples the call stack periodically. Hot loops appear in many samples; cold code appears rarely. The statistical aggregation produces a clear picture of where time is spent.
- Continuous profiling.: Tools like Pyroscope, Parca, and AWS CodeGuru run continuously and capture profile data over time. The continuous capture catches hot loops that appear intermittently; one-shot profiling can miss them.
The detection layer is what catches hot loops before they cascade into incidents. Without detection, hot loops are usually noticed only when they cause visible damage.
Fix
The fix for a hot loop depends on the cause. The common patterns produce specific fixes; understanding the pattern guides the remediation.
- Add backoff.: Retry loops without backoff produce hot loops when the operation keeps failing. Add exponential backoff with jitter; the loop becomes self-throttling. The backoff prevents the hot loop from saturating CPU even when failures persist.
- Caching.: Loops that recompute the same value many times benefit from caching. The cache hit replaces the expensive operation with a fast lookup; the loop runs at memory speed rather than CPU-bound speed.
- Conditional logging that suppresses repeats.: Logs that fire on every iteration should detect repetition and suppress. Log once, increment a counter, log again with the counter when the condition changes. The pattern prevents log volume from masking other useful logs.
- Test the fix.: After fixing, exercise the same workload that triggered the loop. The fix should prevent the pattern; CPU should stay normal; logs should not explode. Without the test, the fix might not actually address the root cause.
- Add detection that catches recurrence.: The dashboard or alerting that caught this hot loop should catch any future ones. The investment in detection pays off across many future incidents.
Hot loop detection is one of those reliability disciplines that catches a class of issues that traditional monitoring misses. Nova AI Ops integrates with log volume data and CPU profiling, surfaces hot loop candidates, and produces the queue that engineers work from to remediate before production impact.