RAG Architecture: The Complete Pipeline
A production RAG system has six moving parts. Get any one wrong and the whole thing produces mediocre answers. Here is what each part actually does, and where teams typically slip.
The six-stage pipeline
A production RAG system runs two pipelines: an offline indexing pipeline that prepares your knowledge base, and an online query pipeline that answers user questions. The split matters because the offline path can be slow and expensive while the online path must be fast and cheap.
Six stages, in order: ingest, chunk, embed, index (offline) and retrieve, rerank, generate (online). Each stage has its own quality lever and its own failure mode. The biggest mistake teams make is treating RAG as “throw documents into a vector DB and pray.”
Ingestion and chunking
Ingestion converts source documents (PDFs, web pages, Notion, Confluence, code) into clean text. Loaders like Unstructured, LlamaParse, or commercial offerings handle most formats. The output is plain text plus structural metadata (headings, page numbers, section IDs).
Chunking splits long documents into retrievable pieces. Three strategies, in order of sophistication:
- Fixed-size chunking: every 500 tokens with 50-token overlap. Simple, often good enough.
- Recursive splitting: split by markdown headings first, then paragraphs, then sentences, until chunks fit a size budget. Preserves structure better.
- Semantic chunking: use an embedding model to find natural topic breaks. Highest quality, slowest.
Chunk size is the single biggest knob. Small chunks (200-400 tokens) are precise but lack context. Large chunks (1500-2000 tokens) carry context but dilute signal in retrieval. Default to 500-800 tokens with overlap, then tune against your eval set.
Embedding and indexing
Each chunk gets embedded into a vector and stored in a vector index alongside its metadata (source URL, chunk position, tags). The embedding model and the index choice are decoupled but related.
For embedding, use a model that matches your content domain (general English, code, multilingual, medical, etc.). Higher-dimensional embeddings (1536+) give marginal gains and 2-4x storage cost.
For indexing, ANN structures like HNSW or IVF give millisecond search at the cost of approximate (95-99%) recall. For under 100K chunks, exact search is fine and simpler.
Retrieval and reranking
At query time, embed the user’s question, find the top-K nearest chunks. Then a second-stage reranker scores each (query, chunk) pair more accurately and reorders.
The two-stage pattern matters: retrieval is cheap and approximate, reranking is expensive and accurate. Retrieve 50-100 candidates, rerank to top 5-10. Skipping the reranker leaves 10-20% recall on the table.
Hybrid retrieval (semantic + keyword) is increasingly the default. Reciprocal Rank Fusion combines them with no tuning required. Codes, names, and rare jargon are why pure semantic retrieval underperforms in practice.
Generation and citation
The final stage assembles a prompt: system instructions, retrieved chunks, the user question, and a directive to answer using only the retrieved context with citations.
Three production tricks:
- Cite by chunk ID. The model emits inline citations like [doc-3] that the UI resolves to source URLs.
- Refuse on insufficient context. The system prompt instructs the model to say “not in the docs” when retrieval is weak. Reduces hallucination dramatically.
- Show retrieved sources to the user. Transparency builds trust and lets users self-correct when retrieval missed.
Latency and cost budget
For a 95th-percentile target of under 3 seconds end to end:
- Embed query: 50-200ms
- Vector retrieve top 100: 20-100ms
- Rerank: 100-500ms
- Generate (input + output, streamed first token): 500ms-2s
The dominant cost is the LLM call. Caching identical queries, deduplicating retrieved chunks, and using smaller models for simple questions cut spend by 50-80% in mature systems.
Three common failure modes
- Bad chunks dominate retrieval. A chunk includes a header from one section and content from another, scoring high for both queries but answering neither. Cleaner chunking fixes most of this.
- Stale index. Documents update; embeddings don’t. Set up automated reindex pipelines and TTL on cached responses.
- The model ignores the context. Retrieval is fine but the model leans on its training. Strengthen the system prompt: “answer ONLY from the documents below; refuse otherwise.”
RAG systems that work in production are built around eval sets that catch all three. Without measurement, you cannot tell which of the six stages is actually broken.