AI & ML Intermediate By Samson Tanimawo, PhD Published Jun 28, 2026 7 min read

Reranking in RAG: The Step Most Pipelines Skip

Embedding-based retrieval is fast and approximate. A reranker is slow and accurate. Putting them in series gives you both. Skipping the reranker is the most common reason RAG systems underperform.

The two-stage pattern

Vector retrieval finds 100 plausibly-relevant chunks in milliseconds. A reranker scores each more carefully and reorders them, returning a top 5-10. The first stage is recall-oriented; the second is precision-oriented.

The math: a bi-encoder (the embedding model used for retrieval) computes embeddings for query and document independently, then compares them with cosine similarity. Fast (cacheable), approximate. A cross-encoder takes (query, document) jointly as input to a single model and outputs a relevance score. Slow (per-pair), accurate.

Cross-encoder vs bi-encoder

Cross-encoders see the query and document as one input. The model can reason about specific overlaps, contradictions, and nuances. Bi-encoders never see the pair together, so they can’t catch query-document interactions.

The accuracy gap is significant. On standard retrieval benchmarks, adding a cross-encoder reranker to a strong bi-encoder retriever improves NDCG@10 by 5-15 points. That’s the difference between “our search is mediocre” and “our search just works.”

Reranker options in 2025

Cohere Rerank: API-based. Best accuracy of the commercial options. ~$2 per 1000 searches.
Voyage Rerank: API-based. Domain-tuned variants for code, law, medical.
BGE-reranker-v2: open-weight. Multiple sizes (base/large/m3). Strong English and multilingual.
jina-reranker-v2: open-weight. Tight, fast, multilingual.

For most teams, start with Cohere Rerank for ergonomics. Migrate to BGE-reranker-v2 self-hosted when API costs exceed self-hosting cost (typically > 10M reranks/month).

Latency math

A cross-encoder evaluates one (query, document) pair at a time. For 100 candidates: 100 forward passes. With batching on a GPU, this is ~50-150ms total. On CPU it’s 500ms-2s.

API providers batch internally and return in 100-300ms typically. Faster than self-hosted CPU; comparable to self-hosted GPU.

Three latency tricks:

Rerank fewer candidates. 50 instead of 100. Almost no recall loss.
Use a smaller reranker model for early rerank, larger for final top-10.
Cache reranks for repeated query+doc pairs (rare in chat, common in batch workloads).

How to add one without breaking the pipeline

Get top 50-100 from your existing retriever.
Send to the reranker as a single batch.
Take the top 5-10 by reranker score.
Pass to the LLM as context.

That’s it. Three lines of code beyond what you already have. Run an A/B comparing the rerank-on vs rerank-off versions. The improvement is usually obvious in the first 50 queries.

If you skip reranking, you’re leaving easy quality on the table. The component is well-understood, fast enough, and trivially integrated. There’s no good reason for a production RAG system in 2025 to ship without it.