Model Promotion: A Canary Ramp That Works in Production
5%, 25%, 50%, 100%. The ramp that catches regressions before they hit everyone, with the metric thresholds that gate each step.
The ramp
Stage 1: 5% of traffic for 24 hours. Catches loud regressions (latency, errors) and obvious quality drops.
Stage 2: 25% for 48 hours. Catches subtler regressions; sample size large enough for stat-sig comparisons.
Stage 3: 50% for 48 hours. Final validation; if metrics hold, promote to 100%.
Stage 4: 100%. The new model is live; old model stays warm for 7 days for rollback.
Metric gates per stage
Latency p99 cannot regress more than 10% vs the previous model.
Error rate cannot regress at all.
Quality (eval score) cannot regress more than 2 percentage points vs the previous model on the standard suite.
Cost can grow up to 15%; beyond that requires explicit approval.
Aborting the ramp
Any gate fails: ramp halts, alerts fire, on-call rolls back. The rollback is one command; the warm previous model takes over.
Aborts are loud. Postmortem follows: which gate, what data, what fix.
Most aborts come from latency or cost regressions, not quality. Quality regressions are subtle; latency and cost are visible.
Eval coverage during ramp
Pre-ramp: full eval suite passes. No exceptions.
During ramp: subset of evals run hourly on canary traffic. Confirms the ramp matches the offline eval.
Post-ramp: full eval suite at 100%. The release is documented with evals before and after.