LoRA and PEFT: Fine-Tuning at 1/1000th the Cost
Full fine-tuning a 70B model costs more than most companies’ entire ML budget. Parameter-efficient fine-tuning gets 90% of the result for 0.1% of the cost. Here is the math and the mechanics.
Why full fine-tuning is unaffordable for most teams
A 70-billion-parameter model has 70 billion weights. Updating all of them during fine-tuning requires gradients for all of them, optimiser state for all of them, and activations for all of them at backward pass time. The memory math: roughly 16-20x the parameter count in bytes during training.
For Llama-70B, that’s ~1.4 TB of GPU memory. You need a multi-node H100 cluster. Fine-tuning runs into the tens of thousands of dollars per experiment.
Parameter-efficient fine-tuning (PEFT) cuts this by 100-1000x by updating a tiny subset of parameters while keeping the base model frozen.
How LoRA actually works
LoRA (Low-Rank Adaptation) freezes the original weights and adds a small trainable update on top. For each weight matrix W of shape (d, k), it adds a rank-r decomposition: W_new = W + B·A, where B is (d, r) and A is (r, k). With r=8 or r=16, that’s a tiny fraction of the original parameters.
For a 70B model with LoRA rank 16 applied to attention projections, you train ~50M parameters instead of 70B. Memory drops from 1.4 TB to ~50 GB. A single A100 or H100 fits the job.
The math behind it: empirically, the difference between a base model and a fine-tuned model lives in a low-rank subspace of weight space. You don’t need to perturb every weight; you need to perturb the right small subspace.
QLoRA: cheaper still
QLoRA combines LoRA with 4-bit quantisation of the base model. The frozen base weights are stored in 4 bits instead of 16, cutting memory another 4x. The trainable LoRA adapters stay in higher precision (16-bit) so optimisation is stable.
QLoRA fits a 70B fine-tune on a single 48GB consumer GPU. The accuracy hit relative to full-precision LoRA is small (~1-2% on most benchmarks). For most production needs, this is the sweet spot.
DoRA and IA3 are newer variants. DoRA decomposes weight updates into magnitude + direction; IA3 multiplies activations by learned vectors. Both edge out vanilla LoRA on some benchmarks; both add complexity. LoRA + QLoRA is still the workhorse for 95% of cases.
Practical recipes
For a typical instruction-tuning fine-tune on a domain corpus:
- Rank: 16 for most tasks, 32 for harder transformations, 64 if you need full fine-tuning quality.
- Alpha (scaling): typically 2x rank.
- Target modules: q_proj and v_proj at minimum; add k_proj, o_proj, mlp for more capacity.
- Learning rate: 1e-4 to 3e-4. Higher than full fine-tuning because you’re training fewer parameters.
- Epochs: 1-3. LoRAs overfit fast.
- Batch size: as large as memory allows; use gradient accumulation for effective batch size 64-128.
The PEFT library (Hugging Face) implements all of this with sane defaults. Don’t hand-roll.
When LoRA isn’t enough
LoRA works for tasks adjacent to the base model’s capabilities: tone adjustment, format compliance, domain vocabulary. It struggles when:
- You need the model to learn fundamentally new behaviour (e.g., a new programming language not in pretraining).
- You need very deep changes (e.g., teaching an English-only model fluent French from scratch).
- The dataset is huge (millions of examples) and quality matters more than cost.
For these, full fine-tuning beats LoRA. The tradeoff is whether the absolute quality gap (often 1-3%) is worth the 100-1000x cost increase.
Shipping LoRAs in production
The big operational win: LoRAs are tiny (50-500 MB). You can serve dozens of LoRAs from a single base model in memory, loading the right one per request based on user, tenant, or task.
vLLM, TGI, and llama.cpp all support multi-LoRA serving. Latency overhead is minimal: switching LoRAs is a few hundred milliseconds compared to the seconds-to-minutes of model loading.
This pattern (one base model + many LoRAs) is how multi-tenant LLM platforms scale economically. Each customer gets their own fine-tune; only one expensive base model is in memory.