MLOps: 12 Things You'll Wish You Built Earlier
Every team that ships ML in production hits the same set of operational gaps within 12 months. Building these now saves the future you a quarter of fire-fighting.
The 12 things
- Experiment tracking: every training run logged with hyperparameters, metrics, and artifacts.
- Data versioning: every dataset has a version pinned to training runs.
- Model registry: every model has a version, lineage, and stage (dev / staging / prod).
- Reproducible training: any past run can be rerun bit-exact within a week.
- Eval harness: standardised eval scripts run against every candidate model.
- Pre-deploy gates: model cannot promote without passing eval thresholds.
- Canary deployment: new model serves a fraction of traffic first.
- Production monitoring: live metrics on prediction quality, not just latency.
- Drift detection: alarms when input distribution shifts from training distribution.
- Rollback path: previous model version one click away.
- Audit trail: who deployed what when, with what reasoning.
- Cost dashboard: $ per prediction visible to the team that owns the model.
In what order
Most teams build them roughly in this priority:
- Experiment tracking + model registry. Without these, nothing else works.
- Eval harness + pre-deploy gates. Stops obvious regressions.
- Production monitoring + canary deployment. Catches real-world failures.
- Rollback + audit trail. For when something goes wrong.
- Drift detection + data versioning. For long-running stability.
- Cost dashboard + reproducibility. For mature optimisation.
Build incrementally. Each step pays for itself before the next is needed.
The lite version (small team)
If you have one ML engineer and a quarter, here’s a minimum viable MLOps:
- MLflow or Weights & Biases for experiment tracking.
- S3 + JSON metadata for the model registry.
- A Python script that runs evals and writes pass/fail to the registry.
- A deploy script that refuses to promote a model without eval=pass.
- Datadog (or your APM) tracking model_id alongside latency.
That covers 6 of the 12. It’s a week of work, and it shrinks your incident surface area dramatically.
Anti-patterns
Three patterns to avoid:
- One-off experiment notebooks. Untracked, unreproducible. Delete.
- Model files in git. Bloats the repo, no real versioning. Use a registry.
- Manual deploys. The path from “merged the change” to “serving the new model” should be automated. Manual deploys mean inconsistent state and slow rollbacks.
What to build first this week
If you’ve done none of this: experiment tracking. MLflow takes an afternoon to set up. Suddenly every run is logged. The next week, build the model registry on top.
If you have tracking but no eval: write the eval script. Make it a CI step on the model repo. Block merges that fail evals.
The rest grows from there. Each piece feeds the next.