Model Promotion: A Canary Ramp That Works in Production
5%, 25%, 50%, 100%. The ramp that catches regressions before they hit everyone, with the metric thresholds that gate each step.
The ramp
The model promotion ramp is staged. 5% for 24 hours catches loud regressions; 25% for 48 hours catches subtler ones with stat-sig sample sizes; 50% for 48 hours is final validation; 100% promotes the new model with the old one staying warm for 7 days for fast rollback.
- Stage 1: 5% for 24 hours. Catches loud regressions (latency, errors) and obvious quality drops.
- Stage 2: 25% for 48 hours. Catches subtler regressions; sample size large enough for stat-sig comparisons.
- Stage 3: 50% for 48 hours. Final validation; if metrics hold, promote to 100%.
- Stage 4: 100% with warm rollback. The new model is live; old model stays warm for 7 days for rollback.
Metric gates per stage
Each stage gates on four metrics. Latency p99 cannot regress more than 10% vs the previous model; error rate cannot regress at all; quality (eval score) cannot regress more than 2 percentage points; cost can grow up to 15% without explicit approval.
- Latency p99 gate. No more than 10% regression vs the previous model.
- Error rate gate. Zero regression allowed; the strictest gate.
- Quality gate. Eval score within 2 percentage points of previous model on the standard suite.
- Cost gate. Up to 15% growth allowed; beyond that requires explicit approval.
Aborting the ramp
Any gate failure halts the ramp, fires alerts, and on-call rolls back via one command because the warm previous model takes over instantly. Aborts are loud, and the postmortem documents which gate, what data, and what fix. Most aborts come from latency or cost regressions because quality regressions are subtle while latency and cost are visible.
- Gate failure aborts. Ramp halts, alerts fire, on-call rolls back; the rollback is one command.
- Warm previous model. Takes over instantly during rollback; the safety net.
- Loud aborts plus postmortem. Which gate, what data, what fix; the abort feeds learning.
- Latency and cost dominate. Most aborts are latency or cost regressions, not quality; quality regressions are subtle.
Eval coverage during ramp
Eval coverage spans the ramp. Pre-ramp: full eval suite passes with no exceptions; during ramp: subset of evals runs hourly on canary traffic and confirms the ramp matches offline eval; post-ramp: full eval suite at 100% with the release documented before and after.
- Pre-ramp full eval. Full eval suite passes; no exceptions.
- During-ramp hourly subset. Subset of evals on canary traffic; confirms ramp matches offline eval.
- Post-ramp full eval. Full eval suite at 100%; release documented with evals before and after.
- Per-stage eval artifact. Each stage produces a stored eval result; supports investigation and audit.