Feature: Eval Harness
Testing framework.
Overview
The Nova eval harness is the testing framework that gates AI model changes. Single benchmarks measure capability at a moment; the framework produces continuous quality assurance by gating every model change against per-task suites that match the actual workload.
- Testing framework. Per-task, per-model evaluation; catches regressions before promotion.
- Per-task suites. Triage, correlation, summarisation each have eval; matches the actual use cases.
- Pass/fail thresholds. Per-task quality bars; produces gates that block under-performing models.
- CI integration plus regression detection. Eval runs in CI before model promotion; per-version comparison catches quality drops.
The approach
The practical approach: per-task suites match each capability, threshold gating blocks under-performing promotions, CI integration runs eval on every change, regression detection per version, documented per-task eval criteria. The team’s discipline produces predictable AI quality rather than vibes-based promotion.
- Per-task suites. Each AI capability has its own eval; the suite matches the workload.
- Threshold gating. Models below threshold do not promote; the gate enforces quality.
- CI integration. Eval runs on every model change; catches regressions before deploy.
- Regression detection plus documented suite. Per-version comparison reveals drift; per-task eval criteria committed for operations.
Why this compounds
Eval harness discipline compounds across model changes. Each evaluated model preserves quality; the team’s AI engineering grows; new capabilities inherit the eval framework.
- Better quality. Eval-gated upgrades preserve quality; the user-visible quality stays consistent.
- Better release safety. Regression detection catches drops; reduces incidents from quality drops.
- Better investment targeting. Eval reveals where to focus; the data drives the next round of model work.
- Institutional knowledge. Each eval teaches model patterns; the team’s AI engineering muscle grows.
Eval harness discipline is an engineering discipline that pays off across years. Nova AI Ops invests in model quality as a first-class engineering surface.