AI & ML Intermediate By Samson Tanimawo, PhD Published Sep 30, 2025 7 min read

Model Versioning at Scale

Once you have more than three deployed models, naming conventions and folder layouts stop scaling. A real model registry becomes table stakes.

The four parts of a model registry

MLflow Model Registry, W&B Models, SageMaker Model Registry, Hugging Face Hub all provide these. Self-rolling on top of S3 + Postgres is fine for small teams; rebuild your own UI is rarely worth it.

Metadata that matters

Per model version, capture:

This metadata is what lets you answer “why is the production model behaving differently than last month’s?” six months later when nobody remembers.

The promotion workflow

A typical promotion path:

  1. Train run completes. Outputs version v1.4-rc1 with eval scores.
  2. Auto-promote to dev stage if eval scores beat current dev.
  3. Human review: a team member promotes to staging after spot-checking.
  4. Canary: staging serves 5% of production traffic. If metrics hold, promote to prod.
  5. Prod serves 100%. Old prod version moves to archived.

Each transition is a record. You can rewind any time: pull the version that was prod 3 weeks ago and serve it.

Model vs dataset versioning

They’re different problems. Models are immutable artefacts; datasets are usually large, partially-overlapping, and append-only.

For datasets, three approaches:

The model registry should always link back to the dataset version used. “Model v1.4 was trained on dataset 2025-09-08” is the answer to most production-vs-staging discrepancy questions.

Three pitfalls