Fine-Tuning vs Prompt Engineering vs RAG: When to Use Each
Three tools, three very different tradeoffs. Most LLM projects pick the wrong one first, then migrate. Here is how to pick correctly from day one.
Three tools, one page
You have an LLM and you want it to do something specific. You have three tools:
- Prompt engineering: change the instructions you give the model.
- Retrieval-augmented generation (RAG): fetch relevant documents at runtime and include them in the prompt.
- Fine-tuning: change the model’s weights on your data.
They get more powerful and more expensive in that order. Most teams should try them in that order too.
Prompt engineering: always start here
Prompt engineering is the practice of structuring the input to the model so you get the output you want. It’s the cheapest and fastest lever you have.
Three techniques worth knowing as a starting toolkit:
- Clear instructions with examples. “Extract names from this text” plus 2-3 input/output pairs dramatically outperforms the instruction alone.
- Chain-of-thought prompting. Adding “Let’s think step by step” or asking for reasoning before the answer improves accuracy on math and logic tasks.
- System prompts with role and constraints. “You are a cautious financial analyst. You always cite sources. You refuse to speculate when the data isn’t sufficient.”
Prompt engineering gets you to a minimum-viable result fast. If it gets you to good-enough, ship and iterate. Don’t fine-tune until you’ve exhausted prompt variations.
Retrieval-augmented generation (RAG): add external knowledge
When the model needs information that isn’t in its training data (your company’s docs, a fresh API response, the latest news), you use RAG.
The core pipeline:
- Precompute embeddings for every document in your knowledge base. Store in a vector database.
- At query time, embed the user’s question. Find the 5-20 nearest documents.
- Stuff those documents into the prompt along with the question.
- The model answers using the retrieved content as grounding.
RAG scales well. You can add a new document to the knowledge base and the model can use it immediately, no retraining needed. You can cite sources (because you know which documents were used). You can control freshness.
RAG’s weaknesses are in the retrieval step: if you fetch the wrong chunks, the model answers from the wrong context. Reranking (a second model that re-scores retrieved results), chunk-size tuning, and hybrid search (embedding + keyword) are the common levers for fixing retrieval.
Fine-tuning: change the model itself
Fine-tuning continues the model’s training on your data. Weights update. The model comes out with new behaviour baked in.
Three varieties, in order of cost:
- Parameter-efficient fine-tuning (LoRA, QLoRA): train a small adapter on top of a frozen base model. Cheap, fast, only a few GB to store.
- Full fine-tuning: update every weight. More capacity, more cost, more risk of forgetting general capabilities.
- From-scratch pretraining: only relevant if you’re a well-funded lab. Don’t.
Fine-tune when: you need a specific style the model can’t easily follow via instructions (a very particular voice, a company-specific format), you have a narrow high-volume task where the prompt is redundant overhead, or you’ve optimised prompting and RAG and still aren’t hitting accuracy targets.
Don’t fine-tune when: your data changes frequently (you’ll be constantly retraining), your task is genuinely knowledge-heavy (RAG handles this better and is cheaper to update), or you haven’t tried prompting hard yet.
Cost and complexity, side by side
| Dimension | Prompt | RAG | Fine-tune |
|---|---|---|---|
| Time to first result | minutes | days | weeks |
| Dollar cost (setup) | $0 | $100-1k | $1k-100k |
| Per-request cost | baseline | +20% (longer context) | baseline or less |
| Freshness | static | live (new docs show up immediately) | stale (retrain to update) |
| Auditability | medium | high (cite sources) | low |
The decision framework
Work in this order. Don’t skip steps.
- Try prompting alone. Write a clear system prompt. Add 3 examples. Test on 50 real inputs. If accuracy is acceptable, ship and iterate.
- If the model is missing knowledge, add RAG. Index the relevant documents. Wire up retrieval. Test again.
- If the model is missing style or format adherence, add few-shot examples or system-prompt constraints. Test again.
- If you’ve exhausted those and still need more, consider fine-tuning. Start with LoRA.
Most teams that fine-tune could have gotten the same result with better prompting and a cleaner retrieval pipeline. Fine-tuning looks sophisticated on a resume; it’s usually the wrong first move.
Why real systems use all three
Production LLM systems rarely rely on a single technique. A typical shape:
- A carefully-engineered system prompt and tool list (prompt engineering).
- RAG over company-specific docs (retrieval).
- A LoRA-fine-tuned small model for high-volume classification or routing (fine-tuning).
The three complement each other. Prompting shapes behaviour. RAG provides knowledge. Fine-tuning makes the common path cheap. Mature teams use all three, applied to the parts of the problem where each is strongest.