AI Agent Operations

Every agent has a test suite,
because production is no place to discover regressions

Agent Fitness gives every agent its own eval set: a curated list of synthetic incidents and the expected agent behavior. Before any prompt change ships, the eval runs. Pass means ship. Fail means investigate. The eval grows over time from real incidents that taught the agent something.

Get Started Talk to Sales

app.novaaiops.com / agent-fitness

● LIVE

Fitness · postgres-doctor

eval set

passing

failing

last run

22m ago

added this week3 (from inc-4821 lessons)

regressed since v 12partitioned-vacuum case

The Eval Set

Synthetic incidents with expected behavior

Each eval is a synthetic incident (signal payload, context bundle, expected tool calls, expected outcome). The runner injects the synthetic into the agent in a sandbox, captures what the agent does, and compares against the expected behavior. Pass requires the right tool calls and the right outcome class. Wrong tool, wrong call shape, or wrong outcome = fail.

✓
Signal + context bundle: each eval ships a full incident-shaped payload so the agent runs as it would in prod
✓
Expected tool calls: pass requires the agent calls the right tools, wrong tool, even if "successful," is a fail
✓
Expected outcome class: pass also requires the post-action verifier to mark the outcome in the expected class

app.novaaiops.com / agent-fitness · eval

Eval · pg-deadlock-classic

signalinc · pg deadlock with pg_locks fixture

contextnovabank-fixture-2026

expect toolpg_terminate_backend

expect outcomedeadlocks=0 at T+5m

last runpass · 14:18

Pre-Merge Gate

No prompt PR ships without a green eval

When you open a PR that changes an agent's prompt or schema, CI runs that agent's eval set against the new version. Failing evals block the merge. Passing means safe to ship. The gate has caught hundreds of "this prompt looks better but breaks 3 cases" PRs across our customer base.

✓
CI integration: works with GitHub, GitLab, Buildkite, Bitbucket, same plugin
✓
Required check: failing eval blocks merge; passing eval allows; configurable per repo
✓
Diff-aware reporting: PR comment shows which eval cases passed before and now fail (or vice-versa)

app.novaaiops.com / agent-fitness · gate

PR · postgres-doctor v 13

cases passing41 / 42

regressionpartitioned-vacuum-case

verdictblock merge

Daily Baseline

Same prompt, same outcome, every day

Even when prompts do not change, the eval runs daily. Provider model changes, infrastructure changes, library bumps, any of these can silently shift agent behavior. Daily runs catch the silent regressions before customers do. Failures fire a notification to the agent owner.

✓
Daily run, no prompt change required: the goal is catching upstream drift, not just tracking your own changes
✓
Notification on regression: agent owner gets pinged when an eval that was passing yesterday fails today
✓
Auto-bisect: when a regression occurs, the runner bisects: model? library? data fixture?, narrows the cause

app.novaaiops.com / agent-fitness · daily

Daily · last 30d

runs30

regressions caught2

top causemodel upgrade · sonnet 4-5 → 4-6

resolutionprompt edit + pin to 4-5 in fixture

Harvest from Incidents

Lessons learned become test cases

When a real incident reveals a gap in the agent's capability, the postmortem builder offers to harvest the incident as a new eval case. One click captures the signal, the context, the agent's actual behavior, and the corrected expected behavior. Future regressions on that class are caught.

✓
One-click harvest: from postmortem to eval case in one click; a small editor lets you tweak the expectation
✓
Curated, not auto-added: humans review every harvest; the eval set stays high-signal
✓
Annotated: every harvested case carries a link to the source incident, so context never gets lost

app.novaaiops.com / agent-fitness · harvest

Harvest · inc-4821

source incidentinc-4821 · postgres lock storm

capturedsignal + context + actual + expected

added to setpostgres-doctor v 13

Every agent has a test suite,because production is no place to discover regressions