Retrieval Quality Failure Modes (and How to Spot Them)

Bad chunks. Wrong embeddings. Stale index. Drift. Five failure modes for RAG retrieval, with detection patterns for each.

The five failure modes

Retrieval quality fails in five recognisable ways. Bad chunks (documents split at wrong boundaries; chunk has no context); wrong embeddings (model poorly suited to the domain; semantic similarity misleading); stale index (lags source of truth; recent docs missing); drift (data distribution shifts but retrieval pipeline does not); cardinality blowup (top-k dominated by near-duplicates).

Detection patterns

Each failure mode has a detection pattern. Bad chunks: human review of 50 random retrieved chunks per week (bad ones obvious); wrong embeddings: A/B compare two embedding models on the same task; stale index: track index lag (alert on lag > 1 hour for hot data); drift: track retrieval result diversity over time; cardinality: track diversity of top-k results.

Fixes per mode

The fixes are mode-specific. Bad chunks: re-chunk with semantic boundaries (sentence or paragraph), not fixed-size. Wrong embeddings: switch to a domain-specialised embedding model. Stale index: shorten the index update cadence; streaming updates if hot. Drift: re-train embeddings periodically. Cardinality: dedupe in retrieval; use MMR (Maximal Marginal Relevance) to enforce diversity.

RAG-specific eval

Two metrics drive RAG quality. Recall@k (of the documents that should have been retrieved, how many appeared in the top-k; below 80% is a problem); precision@k (of the top-k retrieved, how many were actually relevant; below 60% is noisy). Both move together so track both because a high-recall low-precision system buries the user in noise.