AI Safety & Governance

Two agents fighting in a loop is not progress,
the detector catches it in seconds, not days

Livelock Detector watches for oscillation patterns: scale up, scale down, scale up, scale down. Open a feature flag, close it, open it again. Agents fighting agents. The detector recognizes the pattern, halts both sides, pages an operator, and writes a runbook with the loop reproduction. No more "we paid for autoscaling that flapped 800 times overnight."

Get Started Talk to Sales

app.novaaiops.com / livelock

● LIVE

Detected · last 7d

payments · scale ping-pong12 cycles · halted

agentscost-trimmer ↔ scale-out-on-load

cart · flag flapping5 cycles · paged

agentsexperiment-arbiter ↔ rollout-watcher

How It Detects

Three cycles, then halt

The detector keeps a rolling window per resource (per service, per flag, per IAM role) of the last few state-change actions. When the window shows three identical reversals (state A → B → A → B → A → B), it declares a livelock. Three cycles is enough to be sure, few enough to halt before damage. Both sides are paused, and the loop reproduction is written to a runbook.

✓
Rolling window per resource: state changes are scoped to the resource, not global, so unrelated agents do not falsely trigger
✓
Three reversals = halt: two could be a normal correction; three is a pattern
✓
Both sides paused: both contributing agents transition to read-only until an operator releases them

app.novaaiops.com / livelock · detection

Window · payments-deploy replicas

14:02scale up · 4 → 8 · scale-out-on-load

14:08scale down · 8 → 4 · cost-trimmer

14:14scale up · 4 → 8 · scale-out-on-load

14:20scale down · 8 → 4 · cost-trimmer

14:26scale up · 4 → 8 · scale-out-on-load · livelock detected · halt

Halt Behavior

Both agents go read-only on the resource

When the detector triggers, both agents are paused on the contested resource only. They keep working on other resources. The contested resource is locked from automated change until an operator reviews. This minimizes the disruption while preventing the loop from continuing.

✓
Resource-scoped pause: agents are not killed; they cannot touch the contested resource
✓
Other work continues: the same agents keep handling unrelated incidents on other services
✓
Lock cleared by operator: release requires a human acknowledgment that the conflict is resolved

app.novaaiops.com / livelock · halt

Halt state · payments-deploy replicas

contested resourcek8s-deploy/payments/replicas

scale-out-on-loadpaused on resource

cost-trimmerpaused on resource

other resourcesunaffected

Auto-Generated Runbook

A reproduction the operator can review

The detector writes a runbook capturing the loop's state machine: which agent reverses what, at which threshold, citing what evidence. The runbook is the artifact you read to understand the conflict. Most loops resolve via either tightening one agent's trigger threshold or adding a hysteresis band so they do not contradict each other.

✓
State machine captured: the runbook shows the trigger conditions and reversals as a small diagram
✓
Suggested resolutions: common fixes: hysteresis bands, ownership boundaries, escalation thresholds
✓
One-click apply: recommended config changes can be applied from the runbook with operator approval

app.novaaiops.com / livelock · runbook

Runbook · payments scale-loop

scale-out triggercpu > 70% for 60s

scale-down triggercpu < 65% for 60s

diagnosisoverlapping bands

fixscale-down threshold to < 50% for 5m

Reporting

Loops are tuning signal, not noise

A loop a week is normal in a busy fleet. A loop a day is a tuning problem. The weekly report tracks loop count, top contested resources, top contributing agents, and the resolution rate. Use it to spot configuration mistakes before they cause customer-visible incidents.

✓
Weekly loop report: count, top resources, top agents, resolution rate, emailed Monday
✓
Contested-resource ranking: helps you see which areas of the platform have ambiguous ownership between agents
✓
Per-agent contribution: agents that show up in many loops may need a tighter scope or a longer trigger window

app.novaaiops.com / livelock · report

Loops · this month

loops

resolved

recurring

avg dwell

42m

Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

A loop is not effort, it is waste

Catching oscillation early is the difference between a learning system and a system that burns cloud credits all night.

Get Started Request a Demo

Two agents fighting in a loop is not progress,the detector catches it in seconds, not days