Feature: Eval Harness

Testing framework.

Overview

The Nova eval harness is the testing framework that gates AI model changes. Single benchmarks measure capability at a moment; the framework produces continuous quality assurance by gating every model change against per-task suites that match the actual workload.

Testing framework. Per-task, per-model evaluation; catches regressions before promotion.
Per-task suites. Triage, correlation, summarisation each have eval; matches the actual use cases.
Pass/fail thresholds. Per-task quality bars; produces gates that block under-performing models.
CI integration plus regression detection. Eval runs in CI before model promotion; per-version comparison catches quality drops.

The approach

The practical approach: per-task suites match each capability, threshold gating blocks under-performing promotions, CI integration runs eval on every change, regression detection per version, documented per-task eval criteria. The team’s discipline produces predictable AI quality rather than vibes-based promotion.

Per-task suites. Each AI capability has its own eval; the suite matches the workload.
Threshold gating. Models below threshold do not promote; the gate enforces quality.
CI integration. Eval runs on every model change; catches regressions before deploy.
Regression detection plus documented suite. Per-version comparison reveals drift; per-task eval criteria committed for operations.

Why this compounds

Eval harness discipline compounds across model changes. Each evaluated model preserves quality; the team’s AI engineering grows; new capabilities inherit the eval framework.

Better quality. Eval-gated upgrades preserve quality; the user-visible quality stays consistent.
Better release safety. Regression detection catches drops; reduces incidents from quality drops.
Better investment targeting. Eval reveals where to focus; the data drives the next round of model work.
Institutional knowledge. Each eval teaches model patterns; the team’s AI engineering muscle grows.

Eval harness discipline is an engineering discipline that pays off across years. Nova AI Ops invests in model quality as a first-class engineering surface.