AI Agent Operations

Every agent has a test suite,
because production is no place to discover regressions

Agent Fitness gives every agent its own eval set: a curated list of synthetic incidents and the expected agent behavior. Before any prompt change ships, the eval runs. Pass means ship. Fail means investigate. The eval grows over time from real incidents that taught the agent something.

Get Started Talk to Sales
app.novaaiops.com / agent-fitness
● LIVE
Per-agent
eval suite
Pre-merge
gate on prompt PRs
Real-world
cases harvested from incidents
Daily
baseline runs
The Eval Set

Synthetic incidents with expected behavior

Each eval is a synthetic incident (signal payload, context bundle, expected tool calls, expected outcome). The runner injects the synthetic into the agent in a sandbox, captures what the agent does, and compares against the expected behavior. Pass requires the right tool calls and the right outcome class. Wrong tool, wrong call shape, or wrong outcome = fail.

  • Signal + context bundle: each eval ships a full incident-shaped payload so the agent runs as it would in prod
  • Expected tool calls: pass requires the agent calls the right tools, wrong tool, even if "successful," is a fail
  • Expected outcome class: pass also requires the post-action verifier to mark the outcome in the expected class
app.novaaiops.com / agent-fitness · eval
Pre-Merge Gate

No prompt PR ships without a green eval

When you open a PR that changes an agent's prompt or schema, CI runs that agent's eval set against the new version. Failing evals block the merge. Passing means safe to ship. The gate has caught hundreds of "this prompt looks better but breaks 3 cases" PRs across our customer base.

  • CI integration: works with GitHub, GitLab, Buildkite, Bitbucket, same plugin
  • Required check: failing eval blocks merge; passing eval allows; configurable per repo
  • Diff-aware reporting: PR comment shows which eval cases passed before and now fail (or vice-versa)
app.novaaiops.com / agent-fitness · gate
Daily Baseline

Same prompt, same outcome, every day

Even when prompts do not change, the eval runs daily. Provider model changes, infrastructure changes, library bumps, any of these can silently shift agent behavior. Daily runs catch the silent regressions before customers do. Failures fire a notification to the agent owner.

  • Daily run, no prompt change required: the goal is catching upstream drift, not just tracking your own changes
  • Notification on regression: agent owner gets pinged when an eval that was passing yesterday fails today
  • Auto-bisect: when a regression occurs, the runner bisects: model? library? data fixture?, narrows the cause
app.novaaiops.com / agent-fitness · daily
Harvest from Incidents

Lessons learned become test cases

When a real incident reveals a gap in the agent's capability, the postmortem builder offers to harvest the incident as a new eval case. One click captures the signal, the context, the agent's actual behavior, and the corrected expected behavior. Future regressions on that class are caught.

  • One-click harvest: from postmortem to eval case in one click; a small editor lets you tweak the expectation
  • Curated, not auto-added: humans review every harvest; the eval set stays high-signal
  • Annotated: every harvested case carries a link to the source incident, so context never gets lost
app.novaaiops.com / agent-fitness · harvest
Video walkthrough coming soon

Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.

A test suite for prompts

You would not ship code without tests. Fitness gives prompts the same protection.

Get Started Request a Demo