AI & ML Intermediate By Samson Tanimawo, PhD Published Jun 17, 2025 9 min read

Vector Search at Scale: Beyond pgvector

pgvector handles a million vectors fine. At ten million it slows. At a hundred million it stops. Here is what changes architecturally past the pgvector ceiling.

The pgvector ceiling

pgvector is excellent for the first million vectors. Above that, three things break:

If you’re still under 10M vectors and not adding fast, pgvector is fine. Above that, dedicated infrastructure usually wins.

Index types beyond flat

The three index families you’ll see at scale:

For 1M-100M vectors with strict latency targets, HNSW is usually right. For 100M-1B+ with cost pressure, IVF+PQ or DiskANN.

Sharding strategies

Beyond a single node’s memory, you shard. Two strategies:

Random sharding: distribute vectors uniformly across N shards. Each query fans out to all shards, results are merged. Simple. Linear scaling. Cost grows with N.

Filtered sharding: shard by a metadata key (tenant ID, language, document type). Each query routes to the relevant shard only. More efficient if your queries are predictably scoped. Trickier to balance.

Most production systems use random sharding for vectors and rely on metadata filters within each shard for query scoping. Filtered sharding shows up at very large multi-tenant SaaS scale where one tenant’s search shouldn’t pay for another’s.

Production options at scale

The shortlist for 10M+ vectors in 2025:

For most teams <100M vectors: Pinecone if budget, Qdrant if self-host. For 100M+ or multi-modal: Milvus or Vespa.

Cost dynamics at scale

Three line items dominate vector-search cost:

Quantising vectors (PQ to 32-64 bytes per vector) is the single biggest memory lever. Recall drops 1-2%, footprint drops 10-20x.

Migrating without downtime

The standard pattern when moving from pgvector to a dedicated store:

  1. Set up the new store. Backfill embeddings via a batch job.
  2. Add a feature flag: each query goes to both stores. Compare results in logs.
  3. Once you trust the new store, switch the read path. Keep dual-writing for a week.
  4. Cut over writes. Decommission pgvector.

The dual-write window is the critical step. Skipping it means you discover the new store’s edge cases in production with no rollback path. Two weeks of dual-write costs a few hundred dollars and saves you from a six-figure outage.