Active Learning at Scale
Active learning means the model picks which examples to label next. Done right, it cuts labelling cost 5-10x with no accuracy loss.
The idea
You have 1M unlabelled examples. You can afford to label 10K. Random sampling is wasteful: many examples are easy and add little. Active learning lets the model pick the 10K examples that will most improve it.
Query strategies
- Uncertainty sampling: pick the examples the current model is most uncertain about.
- Diversity sampling: pick examples that cover the input space evenly.
- Expected gradient length: pick examples that would most change the model if labelled.
- Hybrid: combine uncertainty with diversity to avoid querying clusters of similar hard examples.
In practice
The 5-10x reduction is real on text and image classification. Underused because it requires a labelling pipeline that supports incremental queries (humans label, model retrains, requests next batch). Most teams batch-label up front and pay the full cost. Worth fixing if labelling is on your critical path.