Document Chunking Strategies That Actually Work
Bad chunking dooms a RAG system before retrieval ever runs. Good chunking is the highest-leverage thing you can tune for retrieval quality. Here is what works.
The four strategies
Fixed-size: split every N tokens with M-token overlap. Brutally simple. Works well enough for many use cases. Default starting point.
Recursive: try splitting on largest semantic boundary (e.g., chapter heading), recursively split if any chunk is too large. Preserves document structure. Best for well-structured docs (markdown, reports).
Semantic: embed sentences, find local discontinuities in embedding space, split there. Highest quality. 5-10x slower indexing. Worth it for content where topical coherence matters more than parsing structure.
Hierarchical: store both small and large chunks. Retrieve at the small level, expand to the parent for context. Best for retrieval where context matters but precision is critical.
Choosing chunk size
The fundamental trade-off:
- Small chunks (200-400 tokens): precise retrieval, but the chunk may lack context the LLM needs to answer.
- Large chunks (1500-2000 tokens): rich context, but signal dilutes, the relevant sentence is buried in semi-relevant material that drives down embedding similarity.
For most RAG: 500-800 tokens. For dense technical content: 300-500. For prose: 800-1200. Tune against an eval set; don’t pick a number from a blog post (including this one).
Overlap that helps
Overlap (tokens shared between adjacent chunks) ensures content near a chunk boundary is retrievable from either side. Without overlap, a fact stated at the boundary risks being missed.
Practical: 10-20% overlap. So 600-token chunks with 60-100 token overlap. More overlap means more storage and more redundant retrievals; less means more boundary misses. The 10-20% sweet spot is empirically robust.
Metadata you should always attach
Embed plus metadata is the standard. Attach to every chunk:
- Source URL or document ID.
- Position within document (chunk N of M).
- Section heading or path.
- Last-modified timestamp.
- Tags / category / language.
You’ll wish you had this metadata the moment a user asks “where is this from?” or you need to filter by recency. Adding it post hoc means re-embedding the corpus.
Measuring chunking quality
Build an eval set of (query, ideal-chunk-from-doc) pairs. Measure recall@10: what fraction of queries return the ideal chunk in their top-10 retrieved.
Run the same eval with different chunking strategies. The strategy that maximises recall@10 wins. There’s no other reliable way to compare; intuition about “chunking should be semantic” rarely matches measured results.
Most teams that A/B chunking strategies discover that fixed-size with 10% overlap beats more sophisticated methods on their actual data. The simpler strategy wins more often than blog posts suggest.