AI & ML Advanced By Samson Tanimawo, PhD Published Aug 19, 2026 7 min read

Fine-Tuning Llama and Mistral for Domain Tasks

The open-weight base models are great. They’re also generic. A few thousand domain examples and a weekend of fine-tuning turn them into your tool, not Meta’s.

When to fine-tune

Fine-tune when prompting + retrieval cannot reliably produce the behavior you need. Common triggers: very specific output formats (medical coding, structured invoice extraction), specialised vocabulary (legal contracts, scientific papers), tone consistency (your brand voice), or low-latency single-pass inference (no time for chain-of-thought).

The "can prompting solve it" check. Spend a week trying to make a base model + prompt + retrieval do the task. Iterate prompts; add few-shot examples; test retrieval. If you can get to 90%+ accuracy with prompting alone, prompting wins, it's faster to iterate, easier to maintain, and doesn't require a model artifact.

The fine-tune signals. Prompting plateaus at 70-80% on accuracy. Long prompts (5K+ tokens) needed for adequate behaviour. Specific output format that prompting can't reliably enforce. These are the cases where fine-tune makes the cost-benefit math work.

The economics. Fine-tuning a 7B-13B model on LoRA costs $100-500 in compute. Inference savings at production volume (fewer tokens per call from a smaller, behavior-aligned model) typically pay back the training cost within weeks for high-volume use cases.

The timing. Fine-tune AFTER you've validated the use case with prompting. Fine-tune optimises a known-good behavior; it doesn't discover what behavior to want. Fine-tuning before you know the desired behavior just bakes in early-prototype mistakes.

Dataset preparation

500-5000 high-quality examples beat 50,000 mediocre ones. Curate aggressively. Examples should cover the long tail of cases you actually see in production, not just the easy cases. Include adversarial examples (cases the model gets wrong before fine-tune) so the model specifically learns to handle them.

The curation principle. Each example should be one you'd happily show as "here's how the model should behave." Examples that confuse you confuse the model. Examples with errors teach errors. Curation is the bulk of fine-tune work; it pays back many-fold over training-time.

The diversity principle. Cover the use case's distribution. Common cases AND edge cases. Easy cases AND hard ones. Multiple phrasings of similar requests. The model learns generalisation from diversity; uniform datasets produce narrow models.

The format consistency. If outputs need a specific format (JSON schema, structured response), every example should follow it exactly. Inconsistent formats teach the model "format doesn't matter"; consistent formats teach the format. Format is what fine-tune is good at; don't waste the opportunity.

The held-out test set. Reserve 10-20% as held-out, never train on these. Use them to evaluate. Without a held-out set, you're flying blind: training metrics tell you about training data, not generalisation.

LoRA recipe

LoRA (Low-Rank Adaptation) is the workhorse. Train small adapter matrices instead of full weights. 100-1000x cheaper than full fine-tune; quality is competitive for most tasks. Defaults: rank 16, alpha 32, dropout 0.1, learning rate 1e-4, 3-5 epochs. Tune from there based on eval. Use QLoRA (quantised LoRA) for tight GPU budgets, same recipe, 4-bit quantisation cuts memory by 4x.

The rank choice. Rank 4-8 for narrow tasks (classification, simple extraction). Rank 16-32 for medium tasks (domain-specific generation). Rank 64+ for broad behavior changes. Higher rank = more capacity but more compute and overfit risk.

The learning rate choice. 1e-4 is the safe default for most LoRA fine-tuning. Too high (1e-3) overfits fast; too low (1e-5) underfits. Sweep 5e-5, 1e-4, 5e-4 if defaults don't produce good results; pick the one with best held-out performance.

The epoch choice. 3-5 epochs is typical. More epochs risk overfitting on small datasets. Watch the held-out metric, if it plateaus or drops while training metric improves, stop. The "early stopping" discipline is core to good fine-tuning.

The QLoRA reality. 4-bit base model + LoRA adapters lets you fine-tune 70B models on a single A100. Quality loss vs full-precision LoRA is small (1-2 percentage points typically). For most teams, QLoRA is the right default.

Eval

You need an automated eval BEFORE you start training. The eval is what tells you whether each training run improved or regressed. Without an eval, fine-tuning is gambling. Eval should test the actual production behavior on held-out examples, not training-set similarity, not BLEU scores, the actual task.

The eval design. Define the task; define the metric; collect held-out examples; build a script that scores model outputs. The script becomes the iteration tool, every time you train, you run eval, you compare to baseline, you decide if it's worth keeping.

The metric choice. Match the metric to the task. Classification: accuracy or F1. Extraction: field-by-field exact match. Generation: LLM-judge or human review on a sample. Don't use BLEU/ROUGE for tasks where they're known-noisy; pick metrics that correlate with downstream success.

The baseline comparison. Always compare fine-tuned vs base model + best prompt. Sometimes the fine-tune doesn't actually beat a well-prompted base; if so, ship the prompt and skip the fine-tune. The comparison is what prevents shipping fine-tunes that don't add value.

The regression check. After fine-tuning, the model should still be good at general tasks (not just the fine-tune target). Run a small general-capability eval; verify the fine-tune didn't catastrophically degrade base ability. Catastrophic forgetting is a known fine-tune risk.

Deploy

Quantise to int8 or 4-bit for inference. Run on smallest GPU that fits (often a single L4 or A10 for 7B, A100 for 13B). Serve via vLLM or similar, naive HuggingFace transformers inference is 5-10x slower. Load test before production; predict your throughput; provision capacity accordingly.

The quantisation choice. Int8 is the safe default, 2x throughput improvement, negligible quality loss. Int4 is 3-4x throughput improvement with measurable quality loss (1-3 percentage points typical). For latency-sensitive use, int8; for cost-sensitive, int4.

The serving stack. vLLM for high-throughput batched serving. TensorRT-LLM for lowest-latency serving on NVIDIA hardware. Triton for multi-model serving with mixed traffic. Choose based on your traffic profile; don't over-engineer if traffic is modest.

The capacity planning. Measure tokens-per-second on a representative load test. Production traffic should run at 60-70% utilisation peak; provision accordingly. Headroom matters because spikes happen and saturated servers have terrible latency.

The rollback path. Keep the previous model version deployable. Fine-tune updates are not always improvements; the ability to roll back in minutes is what makes iteration safe. Without a rollback path, every deploy is a high-risk operation.

Four mistakes

  1. Fine-tuning before exhausting prompt engineering. Prompting is faster to iterate; fine-tune only after prompting plateaus.
  2. Mixing training and eval data. Inflated metrics; production performance disappoints. Strict held-out is non-negotiable.
  3. Tuning hyperparameters without proper eval. "It seemed better" isn't proof. Eval first, tune second.
  4. Skipping the regression check. Catastrophic forgetting silently degrades general ability. Test general capability after every fine-tune.

Common antipatterns

One-shot fine-tune as a panacea. Fine-tune is a tool, not magic. If prompting + retrieval is at 60%, fine-tune likely gets to 75%, not 95%. Match expectations to capability.

Fine-tuning on raw user data. User data has errors, PII, and inconsistency. Curate first; never train on raw production data.

No version control for training data. Datasets evolve; reproducibility requires git-like versioning. DVC or similar tools are the right primitive.

Training repeatedly without eval. Without eval-driven iteration, you're optimising in the dark. Eval is the steering wheel; without it, training runs go anywhere.

What to do this week

Three moves. (1) For a use case where prompting plateaus, build the held-out eval first. The eval will tell you whether fine-tune is worth pursuing. (2) Run a LoRA pilot on 500 curated examples; compare against best-prompted baseline; let the numbers decide. (3) If the pilot wins, plan the deployment infrastructure (quantisation, serving stack, rollback) before scaling the training. Production deployment is half the work; don't underestimate it.