Embedding Models: Choosing One for Your First Project
Pick the wrong embedding model and your whole RAG pipeline produces mediocre results, regardless of how clever the rest of it is. Pick a sane default and you can ignore this for a year.
Embeddings, in one paragraph
An embedding model takes a piece of text (or image, or audio) and outputs a fixed-length vector of floating-point numbers. Texts with similar meaning produce similar vectors. The model itself is a neural network with a single concrete output: this 768-dimensional (or 1536, or whatever) vector.
Quality of the embedding model upper-bounds the quality of your semantic search. A weak embedding model can’t be rescued by a fancier vector database or a smarter prompt.
API vs open-weight
You have two paths.
API embeddings: send text to an OpenAI, Cohere, Voyage, or similar endpoint and receive a vector. Pricing is per-million-tokens, typically $0.02-0.13 per million. Latency is 100-500ms depending on the provider and load. The vendor manages the model and updates it.
Open-weight embeddings: download a model (BGE, sentence-transformers, gte, e5) and run it yourself on CPU or GPU. No per-call cost, but you pay for the hardware and operational overhead. The leading open-weight models in 2025 are competitive with the best APIs on most benchmarks.
For under 100K embeddings/day, APIs are usually cheaper than running your own GPU. Above that, self-hosting tilts toward break-even and beyond.
Dimensions: 384 vs 1536 vs 3072
Higher dimensions can encode more nuance but cost more. Practical tradeoffs:
- 384 dimensions (e.g., bge-small, all-MiniLM-L6-v2): tiny, fast, surprisingly good for most semantic tasks. Default for tight budgets.
- 768 dimensions (e.g., bge-base, instructor-base): a sweet spot for quality vs cost. Most production systems use this range.
- 1024-1536 dimensions (e.g., text-embedding-3-small, bge-large): noticeable quality bump on hard tasks like long-document search and cross-lingual matching.
- 3072 dimensions (text-embedding-3-large, voyage-3-large): the top of the line. Useful when accuracy matters more than speed/cost.
Storage and query time scale roughly linearly with dimension. A 3072-dim index is 4× the storage of a 768-dim index. For 10M vectors that’s 30GB vs 7.5GB. Not a deal-breaker, but real.
English-only vs multilingual
If your content or your users span multiple languages, use a multilingual model. Otherwise English-only models are typically slightly better on English-only tasks because their training was concentrated.
Most modern API embeddings (OpenAI, Cohere) are multilingual by default. Open-weight options for multilingual: bge-m3, multilingual-e5-large, paraphrase-multilingual-mpnet.
Watch for cross-lingual quality. A multilingual model that handles English, Spanish, and French well might fall apart on Korean or Vietnamese. Test on actual data from your top languages before committing.
Domain-specific models
For specialised content, generic embeddings underperform. Three domains where specialised models often win:
- Code: starencoder, codet5p-embedding, Voyage-code-3. Trained on programming languages, much better at code-search than text models.
- Medical / scientific: BioBERT, MedCPT, sci-bge. Better at PubMed-style content.
- Legal: Voyage-law-2, legal-bert. Better at case-law and contract retrieval.
The benefit is usually 10-25% better recall on domain tasks. Not huge, but free if you’re already in that domain.
The starter default
If you’re starting today and your content is English-language general text, default to OpenAI’s text-embedding-3-small. It’s 1536-dim, $0.02 per million tokens, multilingual-capable, and fast. It will not be the absolute best on any benchmark, but it will be solidly above average everywhere.
Migrate later if you need to. The decision is reversible: re-embed your corpus with the new model and update the vector store. The migration is a weekend, not a quarter.
Avoid the trap of researching for a month to pick the optimal model before you’ve seen any real query patterns. Ship with the default. Measure. If retrieval quality is the bottleneck, then evaluate alternatives. Most teams discover that retrieval quality wasn’t the bottleneck after all.