Retrieval Quality Failure Modes (and How to Spot Them)
Bad chunks. Wrong embeddings. Stale index. Drift. Five failure modes for RAG retrieval, with detection patterns for each.
The five failure modes
Retrieval quality fails in five recognisable ways. Bad chunks (documents split at wrong boundaries; chunk has no context); wrong embeddings (model poorly suited to the domain; semantic similarity misleading); stale index (lags source of truth; recent docs missing); drift (data distribution shifts but retrieval pipeline does not); cardinality blowup (top-k dominated by near-duplicates).
- Bad chunks. Wrong split boundaries; chunk has no context; model cannot use it.
- Wrong embeddings. Model poorly suited to domain; semantic similarity is misleading.
- Stale index. Index lags source of truth; recent docs missing; outdated answers.
- Drift plus cardinality blowup. Distribution shifts; near-duplicates dominate top-k crowding out diverse evidence.
Detection patterns
Each failure mode has a detection pattern. Bad chunks: human review of 50 random retrieved chunks per week (bad ones obvious); wrong embeddings: A/B compare two embedding models on the same task; stale index: track index lag (alert on lag > 1 hour for hot data); drift: track retrieval result diversity over time; cardinality: track diversity of top-k results.
- Bad chunks: human review. 50 random chunks per week; bad ones are obvious.
- Wrong embeddings: A/B test. Two embedding models on the same task; the gap reveals differences.
- Stale index: lag tracking. Time since last update vs source; alert on > 1 hour for hot data.
- Drift and cardinality: diversity metric. Sudden drops indicate distribution shift or near-duplicates.
Fixes per mode
The fixes are mode-specific. Bad chunks: re-chunk with semantic boundaries (sentence or paragraph), not fixed-size. Wrong embeddings: switch to a domain-specialised embedding model. Stale index: shorten the index update cadence; streaming updates if hot. Drift: re-train embeddings periodically. Cardinality: dedupe in retrieval; use MMR (Maximal Marginal Relevance) to enforce diversity.
- Bad chunks: semantic boundaries. Sentence or paragraph; not fixed-size.
- Wrong embeddings: domain specialisation. Switch to a domain-specialised embedding model.
- Stale index: shorten cadence. Streaming updates if hot; the freshness primitive.
- Drift and cardinality: re-train and MMR. Periodic re-train for drift; MMR for diversity.
RAG-specific eval
Two metrics drive RAG quality. Recall@k (of the documents that should have been retrieved, how many appeared in the top-k; below 80% is a problem); precision@k (of the top-k retrieved, how many were actually relevant; below 60% is noisy). Both move together so track both because a high-recall low-precision system buries the user in noise.
- Recall@k. Documents that should have been retrieved appearing in top-k; below 80% problem.
- Precision@k. Top-k retrieved that were actually relevant; below 60% noisy.
- Track both together. They move together; one without the other is misleading.
- High-recall low-precision. Buries user in noise; both metrics must hold.