AI & ML Intermediate By Samson Tanimawo, PhD Published Jun 3, 2025 9 min read

LoRA and PEFT: Fine-Tuning at 1/1000th the Cost

Full fine-tuning a 70B model costs more than most companies’ entire ML budget. Parameter-efficient fine-tuning gets 90% of the result for 0.1% of the cost. Here is the math and the mechanics.

Why full fine-tuning is unaffordable for most teams

A 70-billion-parameter model has 70 billion weights. Updating all of them during fine-tuning requires gradients for all of them, optimiser state for all of them, and activations for all of them at backward pass time. The memory math: roughly 16-20x the parameter count in bytes during training.

For Llama-70B, that’s ~1.4 TB of GPU memory. You need a multi-node H100 cluster. Fine-tuning runs into the tens of thousands of dollars per experiment.

Parameter-efficient fine-tuning (PEFT) cuts this by 100-1000x by updating a tiny subset of parameters while keeping the base model frozen.

How LoRA actually works

LoRA (Low-Rank Adaptation) freezes the original weights and adds a small trainable update on top. For each weight matrix W of shape (d, k), it adds a rank-r decomposition: W_new = W + B·A, where B is (d, r) and A is (r, k). With r=8 or r=16, that’s a tiny fraction of the original parameters.

For a 70B model with LoRA rank 16 applied to attention projections, you train ~50M parameters instead of 70B. Memory drops from 1.4 TB to ~50 GB. A single A100 or H100 fits the job.

The math behind it: empirically, the difference between a base model and a fine-tuned model lives in a low-rank subspace of weight space. You don’t need to perturb every weight; you need to perturb the right small subspace.

QLoRA: cheaper still

QLoRA combines LoRA with 4-bit quantisation of the base model. The frozen base weights are stored in 4 bits instead of 16, cutting memory another 4x. The trainable LoRA adapters stay in higher precision (16-bit) so optimisation is stable.

QLoRA fits a 70B fine-tune on a single 48GB consumer GPU. The accuracy hit relative to full-precision LoRA is small (~1-2% on most benchmarks). For most production needs, this is the sweet spot.

DoRA and IA3 are newer variants. DoRA decomposes weight updates into magnitude + direction; IA3 multiplies activations by learned vectors. Both edge out vanilla LoRA on some benchmarks; both add complexity. LoRA + QLoRA is still the workhorse for 95% of cases.

Practical recipes

For a typical instruction-tuning fine-tune on a domain corpus:

The PEFT library (Hugging Face) implements all of this with sane defaults. Don’t hand-roll.

When LoRA isn’t enough

LoRA works for tasks adjacent to the base model’s capabilities: tone adjustment, format compliance, domain vocabulary. It struggles when:

For these, full fine-tuning beats LoRA. The tradeoff is whether the absolute quality gap (often 1-3%) is worth the 100-1000x cost increase.

Shipping LoRAs in production

The big operational win: LoRAs are tiny (50-500 MB). You can serve dozens of LoRAs from a single base model in memory, loading the right one per request based on user, tenant, or task.

vLLM, TGI, and llama.cpp all support multi-LoRA serving. Latency overhead is minimal: switching LoRAs is a few hundred milliseconds compared to the seconds-to-minutes of model loading.

This pattern (one base model + many LoRAs) is how multi-tenant LLM platforms scale economically. Each customer gets their own fine-tune; only one expensive base model is in memory.