Semantic Search vs Keyword Search
Keyword search finds the exact words. Semantic search finds the meaning. Each fails the other’s easy cases. The right answer in production is almost always to use both.
The two strategies, side by side
Keyword search matches exact terms (or stems). The classical implementation is BM25, a 1990s information-retrieval algorithm that scores documents by term frequency (how often the query terms appear in a doc) weighted against inverse document frequency (how rare those terms are in the corpus).
Semantic search matches meaning. You embed the query into a vector and find documents whose vectors are nearest. “Pod won’t start” matches a doc titled “CrashLoopBackOff troubleshooting” even though the words are different.
Where semantic search wins
- Paraphrasing: queries that use different words for the same concept. “Logging in is broken” finds “Authentication failure recovery.”
- Intent capture: a long natural-language question matches a short relevant snippet even when no words overlap.
- Multilingual: a Spanish query matches an English document if the embedding model is multilingual.
- Concept-level search: “tools for finding bugs” matches “static analysis,” “linters,” and “fuzzers” without naming any of them.
Semantic search shines when your users phrase queries differently from how your documents are written.
Where keyword search wins
- Exact terms: identifiers, error codes, product SKUs, person names. “K8sSchedulingError” should retrieve the exact ticket, not vaguely similar ones.
- Negation and operators: “sre minus aiops” is trivial in keyword search and unreliable in semantic search.
- Rare technical jargon: terms the embedding model rarely saw in training will have weak vectors. Keyword recall is more robust.
- Out-of-corpus codes: a brand-new product SKU that didn’t exist when the embeddings were generated. Keyword search picks it up the moment it’s indexed.
The classic failure mode of pure semantic search: the user types an exact identifier they remember and gets back conceptually-similar documents instead of the exact match they wanted.
Hybrid search: best of both
The pragmatic answer is to run both and combine. Two combination strategies:
- Reciprocal rank fusion (RRF): each search returns a ranked list. Each document gets a score of (1 / rank) from each list. Sum the scores to get a hybrid ranking. Robust, simple, parameter-free.
- Weighted blending: normalise both scores to [0, 1] and combine as alpha × semantic + (1 - alpha) × keyword. Tune alpha on a held-out evaluation set.
RRF is the default starting point. Weighted blending squeezes out a few more points of accuracy if you have the eval data to tune it.
The reranking step
For the highest accuracy, add a cross-encoder reranker after retrieval. Here’s the shape of the pipeline:
- Hybrid retrieval returns the top 50-100 candidates (cheap and approximate).
- A cross-encoder model scores every (query, candidate) pair to produce a more accurate relevance score (slow but exact).
- The top 5-10 from the reranker go to the LLM as context.
Cross-encoders are dedicated reranking models like BGE-reranker, Cohere Rerank, or Voyage-rerank. They’re much smaller than LLMs (typically 100M-1B parameters) and much more accurate than embedding-based retrieval, at a per-query cost of 10-100ms for 50 candidates.
Adding a reranker is the single biggest accuracy improvement most RAG systems can make. Skipping it leaves easy 10-20% gains on the table.
Putting it together
A solid 2025 search stack:
- BM25 keyword index (Elasticsearch / OpenSearch / Postgres full-text).
- Vector index (pgvector / Pinecone / Weaviate / Chroma) using a 768- or 1536-dim embedding model.
- Run both for every query, retrieve top 50 from each, fuse with RRF.
- Rerank the top 100 candidates with a cross-encoder.
- Send the top 5-10 to the LLM.
This shape works at small scale (a thousand documents) and large scale (tens of millions). The components scale independently. Pure semantic or pure keyword setups eventually hit ceilings the hybrid stack doesn’t.