Experiment Tracking with MLflow and Weights & Biases
Two tools dominate. Both work. The right choice depends on your team size, your collaboration model, and how much you trust SaaS with your training data.
What experiment tracking does
Per training run, log: hyperparameters, dataset version, metrics over time, output artefacts (the model, plots, predictions), git commit, and environment. Make all of this searchable, comparable, and shareable.
Without tracking, you don’t know which run produced your best model. Or how to reproduce it. Or which hyperparameter change caused last week’s regression. Tracking is the substrate everything else builds on.
MLflow
Open-source, self-hostable, lightweight. Local SQLite or backed by Postgres + S3. Web UI for browsing runs.
Strengths: free, no SaaS dependency, easy to start (pip install mlflow and a one-liner per run), strong sklearn / PyTorch / TensorFlow integration.
Weaknesses: UI is functional but dated. Team collaboration features are basic. No live training-curve dashboards out of the box (you watch via UI refresh).
Weights & Biases (W&B)
Commercial SaaS (with a generous free tier and academic plan). Hosted by default; self-host available on enterprise.
Strengths: best-in-class UI, real-time training dashboards, hyperparameter sweeps built in, strong team features (sharing, comments, reports), model registry and table views.
Weaknesses: cost at large scale, SaaS dependency, less lock-free if you want to migrate.
Other options worth knowing
- Neptune: similar to W&B, often more flexible for unusual logging shapes.
- Comet: similar to W&B, strong on enterprise compliance.
- Aim: open-source W&B-style UI on top of your own storage.
- TensorBoard: still works for simple cases, especially TensorFlow-heavy. Doesn’t cover artefacts or non-metric logging well.
How to pick
If you’re a single person or 2-3 person team starting out: MLflow. Self-hosted, no cost, easy to migrate later.
If you’re a 5-50 person team that values dashboards and team workflows: W&B. The collaboration features pay back quickly.
If you have data-residency or compliance requirements: MLflow self-hosted, or Neptune/Comet enterprise.
If you already use Databricks: MLflow integrates natively. Stay there.
The wrong move is to pick neither and try to track in spreadsheets. Within three months that breaks down. Pick one, instrument it, move on. The choice is reversible later; not having tracking at all is the expensive mistake.