The Pre-Merge Eval Gate (For Code That Touches AI)
Code that touches AI features should not merge without an eval pass. The gate, the latency, and the team behaviours it changes.
The gate
The gate runs on every AI-touching PR. Eval suite must pass; failure blocks merge by default; override allowed but logged so the team can audit. Latency budget under 5 minutes per suite keeps the gate from becoming the thing engineers route around. Named owner per suite prevents stale or noisy evals from accumulating.
- Eval pass required on AI-touching PRs. Gate condition per PR. Triggers when the PR touches prompts, LLM calls, or related code paths.
- Failure blocks merge. Default block on eval failure. Override allowed but logged for audit.
- Latency budget under 5 minutes. Runtime cap per suite. Longer runs slow the team and erode the gate's credibility.
- Named owner per suite. Maintaining team per eval suite. Catches stale or noisy evals before they become CI pollution.
Scope
The gate is for AI behaviour changes, not pure infrastructure refactors. Prompts trigger it because they directly shape model output; files that call LLM APIs or parse output trigger it because both the call and the parsing surface affect behaviour. Reviewer enforces scope; disagreements escalate to the suite owner.
- All prompt files. Gate trigger per prompt change. Prompts are the surface that most directly affects model output.
- LLM-call and output-parsing files. Both the call surface and the parsing surface trigger the gate. Behaviour changes can land in either.
- Pure infrastructure refactors excluded. No trigger for refactors that do not change AI behaviour. Gate stays focused.
- Reviewer enforces scope. Explicit scope check per PR. Disagreements escalate to the suite owner.
What it changes
The gate changes engineering culture more than it changes any single PR. Evals stop being an afterthought because they are now the thing standing between the PR and merge. Regressions get caught pre-merge rather than in production. Quality compounds as the suite grows alongside features.
- Eval-aware design from the start. Engineers think about evals at design time, not at PR time. Evals stop being an afterthought.
- Regressions caught pre-merge. Eval-driven block stops regressions before production. Real protection.
- Quality compounds. Eval-score movement per PR. Suite grows alongside features; system gets more reliable.
- Quarterly eval-coverage audit. Eval-vs-feature coverage check per quarter. Catches blind spots in the suite.