Agent Fitness gives every agent its own eval set: a curated list of synthetic incidents and the expected agent behavior. Before any prompt change ships, the eval runs. Pass means ship. Fail means investigate. The eval grows over time from real incidents that taught the agent something.
Each eval is a synthetic incident (signal payload, context bundle, expected tool calls, expected outcome). The runner injects the synthetic into the agent in a sandbox, captures what the agent does, and compares against the expected behavior. Pass requires the right tool calls and the right outcome class. Wrong tool, wrong call shape, or wrong outcome = fail.
When you open a PR that changes an agent's prompt or schema, CI runs that agent's eval set against the new version. Failing evals block the merge. Passing means safe to ship. The gate has caught hundreds of "this prompt looks better but breaks 3 cases" PRs across our customer base.
Even when prompts do not change, the eval runs daily. Provider model changes, infrastructure changes, library bumps, any of these can silently shift agent behavior. Daily runs catch the silent regressions before customers do. Failures fire a notification to the agent owner.
When a real incident reveals a gap in the agent's capability, the postmortem builder offers to harvest the incident as a new eval case. One click captures the signal, the context, the agent's actual behavior, and the corrected expected behavior. Future regressions on that class are caught.
Subscribe to Nova AI Ops on YouTube for demos, tutorials, and feature deep-dives.
You would not ship code without tests. Fitness gives prompts the same protection.