AI & ML Practical By Samson Tanimawo, PhD Published Jul 28, 2026 4 min read

Retrieval Quality Failure Modes (and How to Spot Them)

Bad chunks. Wrong embeddings. Stale index. Drift. Five failure modes for RAG retrieval, with detection patterns for each.

The five failure modes

Bad chunks: documents are split at the wrong boundaries. The chunk has no context; retrieval surfaces it but the model cannot use it.

Wrong embeddings: the embedding model is poorly suited to the domain. Semantic similarity is misleading.

Stale index: the index lags the source of truth. Recently-published docs are missing; users get outdated answers.

Drift: the data distribution shifts but the retrieval pipeline does not. Yesterday's good results are today's mediocre.

Cardinality blowup: too many similar chunks; the top-k is dominated by near-duplicates that crowd out diverse evidence.

Detection patterns

Bad chunks: human review of 50 random retrieved chunks per week. Bad ones are usually obvious.

Wrong embeddings: A/B compare two embedding models on the same task; the gap reveals quality differences.

Stale index: track index lag (time since last update vs source). Alert on lag > 1 hour for hot data.

Drift: track retrieval result diversity over time; sudden drops indicate distribution shift.

Cardinality: track the diversity of top-k results; near-duplicates indicate the issue.

Fixes per mode

Bad chunks: re-chunk with semantic boundaries (sentence or paragraph), not fixed-size.

Wrong embeddings: switch to a domain-specialised embedding model.

Stale index: shorten the index update cadence; consider streaming updates if hot.

Drift: re-train embeddings periodically; freshness compounds.

Cardinality: dedupe in retrieval; use MMR (Maximal Marginal Relevance) to enforce diversity.

RAG-specific eval

Recall@k: of the documents that should have been retrieved, how many appeared in the top-k? Below 80% is a problem.

Precision@k: of the top-k retrieved, how many were actually relevant? Below 60% is noisy.

Both move together; track both. A high-recall low-precision system buries the user in noise.