Model Versioning at Scale
Once you have more than three deployed models, naming conventions and folder layouts stop scaling. A real model registry becomes table stakes.
The four parts of a model registry
- Storage: model weights and tokenisers. Usually S3, GCS, or model-registry-specific blob.
- Metadata: structured info about each version (creator, date, lineage, eval scores, status).
- API: programmatic access for upload, download, query, transition.
- UI: human view for review, comparison, promotion.
MLflow Model Registry, W&B Models, SageMaker Model Registry, Hugging Face Hub all provide these. Self-rolling on top of S3 + Postgres is fine for small teams; rebuild your own UI is rarely worth it.
Metadata that matters
Per model version, capture:
- Version identifier (semantic versioning preferred over hashes for human use).
- Parent: which version was this fine-tuned from?
- Training dataset hash + version.
- Hyperparameters used.
- Eval scores from your standard eval set.
- Stage: dev / staging / prod / archived.
- Promoter and timestamp.
- Promotion notes (why this version, what changed).
This metadata is what lets you answer “why is the production model behaving differently than last month’s?” six months later when nobody remembers.
The promotion workflow
A typical promotion path:
- Train run completes. Outputs version v1.4-rc1 with eval scores.
- Auto-promote to dev stage if eval scores beat current dev.
- Human review: a team member promotes to staging after spot-checking.
- Canary: staging serves 5% of production traffic. If metrics hold, promote to prod.
- Prod serves 100%. Old prod version moves to archived.
Each transition is a record. You can rewind any time: pull the version that was prod 3 weeks ago and serve it.
Model vs dataset versioning
They’re different problems. Models are immutable artefacts; datasets are usually large, partially-overlapping, and append-only.
For datasets, three approaches:
- DVC: git-like versioning of large files. Open-source, free, works on any storage backend.
- Lakehouse table versioning (Iceberg, Delta Lake): native table versions in your data warehouse. Best for SQL-native workflows.
- Hash + metadata: roll your own. Hash the dataset on creation; record (model, dataset hash, training timestamp). Smallest infrastructure footprint.
The model registry should always link back to the dataset version used. “Model v1.4 was trained on dataset 2025-09-08” is the answer to most production-vs-staging discrepancy questions.
Three pitfalls
- Manual stage transitions. Engineers click buttons, forget to update the registry, prod is on something the registry says is staging. Automate.
- Silent re-training. A scheduled job retrains and re-deploys without anyone reviewing. The model drifts; nobody knows why. Always require human approval for prod transitions.
- Storage bloat. Models accumulate. After two years you’re paying for hundreds of versions you’ll never use. Set retention: keep last 10 prod versions, archive the rest to cold storage.