Livelock Detector watches for oscillation patterns: scale up, scale down, scale up, scale down. Open a feature flag, close it, open it again. Agents fighting agents. The detector recognizes the pattern, halts both sides, pages an operator, and writes a runbook with the loop reproduction. No more "we paid for autoscaling that flapped 800 times overnight."
The detector keeps a rolling window per resource (per service, per flag, per IAM role) of the last few state-change actions. When the window shows three identical reversals (state A → B → A → B → A → B), it declares a livelock. Three cycles is enough to be sure, few enough to halt before damage. Both sides are paused, and the loop reproduction is written to a runbook.
When the detector triggers, both agents are paused on the contested resource only. They keep working on other resources. The contested resource is locked from automated change until an operator reviews. This minimizes the disruption while preventing the loop from continuing.
The detector writes a runbook capturing the loop's state machine: which agent reverses what, at which threshold, citing what evidence. The runbook is the artifact you read to understand the conflict. Most loops resolve via either tightening one agent's trigger threshold or adding a hysteresis band so they do not contradict each other.
A loop a week is normal in a busy fleet. A loop a day is a tuning problem. The weekly report tracks loop count, top contested resources, top contributing agents, and the resolution rate. Use it to spot configuration mistakes before they cause customer-visible incidents.
Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.
Catching oscillation early is the difference between a learning system and a system that burns cloud credits all night.