Fine-Tuning Llama and Mistral for Domain Tasks
The open-weight base models are great. They’re also generic. A few thousand domain examples and a weekend of fine-tuning turn them into your tool, not Meta’s.
When to fine-tune
Three signals you need fine-tuning, not just prompting:
- You can’t fit the necessary examples in a prompt.
- Domain vocabulary or formatting is highly specific (medical, legal, internal jargon).
- Volume is high enough that prompt overhead dominates per-call cost.
If those don’t apply, prompting + RAG is cheaper and faster.
Dataset preparation
Aim for 1,000-10,000 high-quality (input, ideal-output) pairs. Quality dominates quantity.
- Format pairs in the chat template the base model expects (Llama 3 instruct format, Mistral instruct format, etc.).
- Hold out 10% as eval. Never train on the eval set.
- Deduplicate. Near-duplicates inflate scores artificially.
- Spot-check labels. If 5% are wrong, the model learns 5% wrong things.
LoRA recipe
Reasonable defaults:
- Library:
peft+trlfrom Hugging Face. - Rank: 16. Alpha: 32. Target modules: q_proj, v_proj.
- Learning rate: 2e-4. Schedule: cosine with 10% warmup.
- Epochs: 2-3. Watch for overfit.
- Batch size: as large as memory allows. Effective batch via gradient accumulation.
- Quantisation: 4-bit (QLoRA) for tight VRAM budgets.
A 7B Llama LoRA fine-tune fits on a single 24GB consumer GPU. 70B fits on 1xH100 at 4-bit.
Eval
Run the same benchmark before and after. If accuracy didn’t go up, the fine-tune is doing nothing useful (or worse). Common eval setups:
- Held-out test set, exact-match or BLEU.
- LLM-as-judge against a reference model.
- Human spot-check on 50 samples.
Deploy
LoRA adapters are tiny (50-500 MB). Serve via vLLM with multi-LoRA support, or merge into the base weights for slightly faster inference. Per-tenant LoRAs let you serve customised models without N base-model copies.
Four mistakes
- Fine-tuning on too little data. Below 500 examples, you usually overfit. Get more data.
- Fine-tuning before trying prompting hard. The fine-tune is only worth it if prompting can’t close the gap.
- Catastrophic forgetting. Aggressive fine-tuning damages general capability. Mix in a small fraction of generic instruction-following data.
- Not versioning the dataset. Three months later you can’t reproduce or compare. Treat training data with the same discipline as production code.