Vector Search at Scale: Beyond pgvector
pgvector handles a million vectors fine. At ten million it slows. At a hundred million it stops. Here is what changes architecturally past the pgvector ceiling.
The pgvector ceiling
pgvector is excellent for the first million vectors. Above that, three things break:
- Query latency: HNSW indexes in pgvector slow noticeably above ~5M vectors. p95 latencies climb from 5ms to 50ms+.
- Memory pressure: vector indexes are memory-heavy. A 10M-vector 1536-dim HNSW index is ~120 GB resident. Postgres wasn’t designed for this footprint.
- Concurrent writes: high-rate insertions on HNSW require index rebuilds. Writes block reads more than you’d expect.
If you’re still under 10M vectors and not adding fast, pgvector is fine. Above that, dedicated infrastructure usually wins.
Index types beyond flat
The three index families you’ll see at scale:
- HNSW (Hierarchical Navigable Small World): graph-based, excellent recall (95-99%) at low latency. Memory-hungry. Default in Pinecone, Milvus, Qdrant, pgvector.
- IVF + PQ (Inverted File + Product Quantisation): partition vectors into clusters, search the nearest few clusters. Compresses vectors to fit more in memory. Slightly lower recall, much smaller footprint. Used in FAISS, Milvus.
- DiskANN: disk-based ANN that targets billion-scale collections. Index lives on SSD, not RAM. Microsoft’s research; in production at Bing scale.
For 1M-100M vectors with strict latency targets, HNSW is usually right. For 100M-1B+ with cost pressure, IVF+PQ or DiskANN.
Sharding strategies
Beyond a single node’s memory, you shard. Two strategies:
Random sharding: distribute vectors uniformly across N shards. Each query fans out to all shards, results are merged. Simple. Linear scaling. Cost grows with N.
Filtered sharding: shard by a metadata key (tenant ID, language, document type). Each query routes to the relevant shard only. More efficient if your queries are predictably scoped. Trickier to balance.
Most production systems use random sharding for vectors and rely on metadata filters within each shard for query scoping. Filtered sharding shows up at very large multi-tenant SaaS scale where one tenant’s search shouldn’t pay for another’s.
Production options at scale
The shortlist for 10M+ vectors in 2025:
- Pinecone: managed, serverless, scales to billions. Easy. Expensive at the high end.
- Milvus: open-source, distributed, GPU-accelerated. Strong feature set; operational complexity is real.
- Qdrant: open-source, fast, good developer ergonomics. Single-binary self-host or managed.
- Weaviate: open-source, strong hybrid search and metadata filtering. Modular.
- Vespa: built for very large scale (originally Yahoo’s search). Steep learning curve, unmatched at billions of vectors.
For most teams <100M vectors: Pinecone if budget, Qdrant if self-host. For 100M+ or multi-modal: Milvus or Vespa.
Cost dynamics at scale
Three line items dominate vector-search cost:
- Memory: HNSW indexes are typically 2-3x the size of raw vectors. 100M×1536×4 bytes raw = 600 GB raw, 1.2-1.8 TB indexed.
- Embedding cost: re-embedding 100M docs to test a new model is non-trivial. ~$1k-3k at OpenAI rates. Self-hosted embedding GPUs amortise across rebuilds.
- Query throughput: vector search is CPU-and-memory bound, not network-bound. Throughput scales with shards.
Quantising vectors (PQ to 32-64 bytes per vector) is the single biggest memory lever. Recall drops 1-2%, footprint drops 10-20x.
Migrating without downtime
The standard pattern when moving from pgvector to a dedicated store:
- Set up the new store. Backfill embeddings via a batch job.
- Add a feature flag: each query goes to both stores. Compare results in logs.
- Once you trust the new store, switch the read path. Keep dual-writing for a week.
- Cut over writes. Decommission pgvector.
The dual-write window is the critical step. Skipping it means you discover the new store’s edge cases in production with no rollback path. Two weeks of dual-write costs a few hundred dollars and saves you from a six-figure outage.