Reranking in RAG: The Step Most Pipelines Skip
Embedding-based retrieval is fast and approximate. A reranker is slow and accurate. Putting them in series gives you both. Skipping the reranker is the most common reason RAG systems underperform.
The two-stage pattern
Vector retrieval finds 100 plausibly-relevant chunks in milliseconds. A reranker scores each more carefully and reorders them, returning a top 5-10. The first stage is recall-oriented; the second is precision-oriented.
The math: a bi-encoder (the embedding model used for retrieval) computes embeddings for query and document independently, then compares them with cosine similarity. Fast (cacheable), approximate. A cross-encoder takes (query, document) jointly as input to a single model and outputs a relevance score. Slow (per-pair), accurate.
Cross-encoder vs bi-encoder
Cross-encoders see the query and document as one input. The model can reason about specific overlaps, contradictions, and nuances. Bi-encoders never see the pair together, so they can’t catch query-document interactions.
The accuracy gap is significant. On standard retrieval benchmarks, adding a cross-encoder reranker to a strong bi-encoder retriever improves NDCG@10 by 5-15 points. That’s the difference between “our search is mediocre” and “our search just works.”
Reranker options in 2025
- Cohere Rerank: API-based. Best accuracy of the commercial options. ~$2 per 1000 searches.
- Voyage Rerank: API-based. Domain-tuned variants for code, law, medical.
- BGE-reranker-v2: open-weight. Multiple sizes (base/large/m3). Strong English and multilingual.
- jina-reranker-v2: open-weight. Tight, fast, multilingual.
For most teams, start with Cohere Rerank for ergonomics. Migrate to BGE-reranker-v2 self-hosted when API costs exceed self-hosting cost (typically > 10M reranks/month).
Latency math
A cross-encoder evaluates one (query, document) pair at a time. For 100 candidates: 100 forward passes. With batching on a GPU, this is ~50-150ms total. On CPU it’s 500ms-2s.
API providers batch internally and return in 100-300ms typically. Faster than self-hosted CPU; comparable to self-hosted GPU.
Three latency tricks:
- Rerank fewer candidates. 50 instead of 100. Almost no recall loss.
- Use a smaller reranker model for early rerank, larger for final top-10.
- Cache reranks for repeated query+doc pairs (rare in chat, common in batch workloads).
How to add one without breaking the pipeline
- Get top 50-100 from your existing retriever.
- Send to the reranker as a single batch.
- Take the top 5-10 by reranker score.
- Pass to the LLM as context.
That’s it. Three lines of code beyond what you already have. Run an A/B comparing the rerank-on vs rerank-off versions. The improvement is usually obvious in the first 50 queries.
If you skip reranking, you’re leaving easy quality on the table. The component is well-understood, fast enough, and trivially integrated. There’s no good reason for a production RAG system in 2025 to ship without it.