Sean's Blog

Vector Databases - a benchmark

There is a flood of vector databases - which ones are actually useful? IMO extending a relational DBMS with ACID compliance and existing datasets, is for most use cases the ideal choice. Using a dedicated vector DB like (Chroma, Turbopuffer, LanceDB etc.) only makes sense for narrow use cases where no complicated meta-data filters are needed (e.g. just simple RAG).

So let's have a look how we can store and search vectors using Postgres: There are three extensions for Postgres: pgvector, pgvectorscale and vectorchord.

PGVector

Standard extension to store vectors with common medium scale ANN indices (HNSW & IvfFlat).

Store vectors (float 32 bit):

CREATE TABLE items (id bigserial PRIMARY KEY, embedding vector(1024));

Store half-precision (float 16 bit) vectors:

CREATE TABLE items (id bigserial PRIMARY KEY, embedding halfvec(1024));

HNSW

The most popular ANN index - delivering good retrieval and QPS performance but using lot of RAM for it (did crash for 1 million vectors on my 16GB RAM machine).

IvfFlat

Less RAM hungry than HNSW. With caveats: ivfflat performance degrades as vectors are deleted and added.

"As data gets inserted or deleted from the index, if the index is not rebuilt, the IVFFlat index in pgvector can return incorrect approximate nearest neighbors due to clustering centroids no longer fitting the data well"

PGVectorScale

DiskANN

A promising ANN index that uses RAM and disk (needs fast disk - SSD / NVMe) to scale to billions of vectors, promising low RAM usage while still providing decent QPS. The problem is that the pgvectorscale index building implementation is single core right now - leading to very long index generation times.

PGVectorScale supports pre-filtering using bitfields with manual meta-data table setup (complicated / bad dev ux).

VectorChord

VChordRQ

A custom ANN index with superior performance (combining IVF ANN index with RaBitQ quantization). Supports pre-filtering (easy to use).

VChordG (DiskANN)

A novel addition (not prodution ready yet): custom implementation of DiskANN index combined with RaBitQ quantization.

Benchmark

The benchmark should reflect realistic usage - right now it just measures index built and query times.

In the future I want to extend it:

  • measure insertion perormance after inital index is built
  • use real data embedding vectors -> simulate data distribution shift
  • simulate realistic complex SQL queries involving categorical and range filtering
  • benchmark vector scales: 100K, 1M, 10M, 100M, 1B

ANN Benchmark Results

For 450K text embeddings 1024D float32 vectors - measure the recall@100.

Method Query Latency (ms) Retrieval Recall Speedup vs Baseline Index Build Time (s) Index Size (MB)
Baseline (Brute Force) 1400.93 100.00% 1.00x - -
VectorChord (vchordrq) 468.64 100.00% 2.99x 1383.62 2229
pgvectorscale (DiskANN) 6.39 2.00% 219.22x 550.29 254
pgvector (HNSW) 611.54 100.00% 2.29x 1235.13 3555
pgvector (IVFFlat) 411.62 100.00% 3.40x 968.53 3561

Note: At 450K vectors, all approximate indices show strong speedups. HNSW, IVFFlat, and VectorChord achieve ~100% precision with 2-3.5x speedups. DiskANN has the fastest build time and best speed up (200x) but with significantly lower precision (2%).

Show me the code

Check out the code here

Conclusion

VectorChord is the clear winner - providing superior performance and better developer experience (pre-filtering and better defaul settings). The vchordrq index is for most use cases the ideal choice as it delivers great performance and handles data distribution drift better than diskann indices. Using an ANN index only starts to make sense for huge numbers of vectors (over 10 million).

References

#programming #machine-learning