How Vector Databases Compress Your Embeddings

by

in
HubAI Asia
HubAI AsiaCompare & Review the Best AI Tools

Vector databases don’t actually store your vectors. They store compressed shadows of them — tiny numerical fingerprints that let approximate nearest-neighbor search run at millions of queries per second while sacrificing almost nothing in accuracy. If you’ve ever wondered how Pinecone, Qdrant, or Milvus can search through billions of embeddings in milliseconds, the answer isn’t bigger servers. It’s mathematically elegant compression layered on top of graph algorithms borrowed from social network theory.

Key Facts Most People Don’t Know

  • Pinecone’s proprietary index uses product quantization that compresses 768-dimensional embeddings down to just 96 bytes, achieving 8x compression while maintaining 95% recall accuracy.
  • HNSW (Hierarchical Navigable Small World) graphs, used by Weaviate and Qdrant, create an average of 16-32 bidirectional edges per node across multiple layers, with layer probability decreasing exponentially by factor 1/ln(M).
  • Facebook’s FAISS library introduced IVF (Inverted File Index) in 2017 that partitions vector space into 4096-16384 Voronoi cells, reducing search space by 99.9% but requiring a separate coarse quantizer trained on 256x the cluster count samples.

Let’s trace exactly what happens to an embedding from the moment it enters a vector database to the moment it’s stored on disk — because the journey is far more intricate than “insert into index.”

Step 1: The Raw Embedding Arrives

When your model generates an embedding — whether it’s OpenAI’s text-embedding-3-small at 1536 dimensions or a BERT variant at 768 — it arrives at the vector database as a raw float32 array. That 1536-dimension vector weighs in at 6,144 bytes (1,536 × 4 bytes per float). Store a billion of those uncompressed and you’re looking at 6 terabytes of floating-point data. Nobody does that in production.

The first thing the database does is assess what it’s working with: dimensionality, value distribution, and the existing index structure. This metadata determines which compression path the vector takes.

Step 2: Product Quantization — Splitting and Replacing

This is where the real magic happens. Product quantization (PQ) is the workhorse compression technique behind nearly every production vector database. Here’s how it works internally:

The algorithm splits your 1536-dimensional vector into 8 to 16 subspaces — contiguous chunks of 96 to 192 dimensions each. For each subspace, a pre-trained codebook of 256 centroids (learned via k-means on a sample of your data) is consulted. Each subspace’s chunk of floats is replaced with the single byte index of its nearest centroid.

That’s it. Your 6,144-byte vector just became 96 to 192 bytes — a 32x to 64x compression ratio. Pinecone’s proprietary index uses this exact technique to compress 768-dimensional embeddings down to just 96 bytes while maintaining 95% recall accuracy, an 8x compression over even the naive float16 approach.

But wait — doesn’t replacing 96 floats with a single byte lose information? Absolutely. The key insight is that the relative distances between vectors survive compression well enough for approximate search. You don’t need exact distances; you need the ordering of distances to stay roughly correct.

Step 3: Asymmetric Distance Computation

Here’s a detail most tutorials skip: when you search, the query vector is not compressed. The database computes distances using asymmetric distance computation (ADC). The stored vectors are looked up in the codebook (each byte index becomes a centroid vector), and the distance is computed between the uncompressed query and the reconstructed centroid — not between the compressed query and the compressed stored vector.

ADC is critical because it’s significantly more accurate than symmetric distance computation (where both vectors are quantized). The query pays zero quantization error; only the stored vectors carry approximation noise. This is why a 96-byte compressed vector can still return 95% of the same results as a full-precision scan.

Google’s ScaNN library takes this even further with anisotropic vector quantization, which applies different compression rates per dimension based on directional importance. Tested on 1 billion 128-dimensional vectors, ScaNN achieves 20-30% better recall than standard PQ at the same 32-byte compressed size — because not all dimensions contribute equally to similarity.

Step 4: Insertion Into the HNSW Graph

After compression, the vector needs to be connected to the index’s navigation structure. Nearly every modern vector database — Qdrant, Weaviate, Milvus — uses HNSW (Hierarchical Navigable Small World) graphs for this. The algorithm is deceptively simple:

The compressed vector enters at layer 0 (the bottom layer, containing all vectors). Its insertion layer is determined by a random level l drawn with probability 1/ln(M), where M is the maximum number of connections per node (typically 16-32). Most vectors stay at layer 0. A few climb to layer 1. Fewer still reach layer 2. The top layer is sparse — it contains only the most “well-connected” vectors that serve as express lanes for search.

This exponential decay is what makes HNSW fast: you start search at the top (sparse, long hops), then descend layer by layer (denser, shorter hops), rapidly narrowing the neighborhood around your query.

Step 5: Greedy Search and Edge Creation

At each layer, the algorithm performs a greedy search from the layer’s entry point. It compares the new vector against neighbors using the compressed representations (via ADC), moves toward the closest neighbor, and repeats until no neighbor is closer than the current position.

“HNSW graphs, used by Weaviate and Qdrant, create an average of 16-32 bidirectional edges per node across multiple layers, with layer probability decreasing exponentially by factor 1/ln(M).”

Once the M nearest neighbors are identified (typically M=16), bidirectional edges are created between the new node and each neighbor. These edges are the “small world” connections that make navigation efficient. The bidirectionality ensures that search can flow in both directions — if node A connects to node B, then B also connects back to A.

Step 6: Layer Descent and Connection Completion

After connecting neighbors at the current layer, the algorithm descends to the next layer down, using the closest node found at the current layer as the new entry point. This process repeats — greedy search, neighbor identification, edge creation — all the way down to layer 0.

The result is a multi-layered graph where:
Top layers have few nodes with long-range connections (express highways)
Bottom layers have all nodes with short-range connections (local streets)
Search complexity is O(log N) instead of O(N) for brute force

For a billion-vector index, this means finding the 10 nearest neighbors takes roughly 200-400 distance computations instead of a billion.

Step 7: Segment-Based Storage and Persistence

The final piece is how vectors actually land on disk. Milvus, for example, stores vectors in segments of exactly 512MB each. Each segment is automatically sealed and indexed when full, maintaining its own independent HNSW graph. A single segment typically contains 1 to 8 million vectors depending on dimensionality and compression settings.

This segment architecture serves two purposes. First, it enables incremental indexing — new vectors go into a growing segment, and the HNSW graph is built when the segment seals. No need to rebuild the entire index. Second, it allows parallel search — queries fan out across segments, and results merge at the end.

The vector ID and compressed representation are written to memory-mapped files within each segment. A separate metadata index maintains the mapping between vector IDs and their segment offset positions, so individual vectors can be updated or deleted without touching the entire graph.

Step 8: The IVF Alternative — Partitioning Before Searching

Before HNSW became dominant, Facebook’s FAISS library popularized a different approach: IVF (Inverted File Index). Instead of a graph, IVF partitions vector space into 4,096 to 16,384 Voronoi cells using a coarse quantizer trained via k-means. At search time, the query is compared against cell centroids (not all vectors), and only the top nprobe cells (typically 8-32) are searched in detail.

IVF reduces the search space by 99.9% — from a billion vectors down to roughly 60,000-250,000. The catch? Training the coarse quantizer requires sample data of 256× the cluster count (that’s 4 million to 16 million training vectors for 16,384 cells), and the partitioning is fixed — adding new data distributions requires retraining.

Modern systems often combine both: IVF for coarse partitioning, then HNSW or PQ within each partition. This gives you the space reduction of IVF with the logarithmic search of HNSW.

Why Compression Quality Determines Everything

The entire pipeline — quantization, graph construction, segment storage — exists because raw vectors are too expensive to search at scale. But compression isn’t free. Every quantization method introduces recall loss, and the codebook you train determines how much.

A poorly trained codebook on a narrow dataset will cluster centroids in dense regions while leaving sparse regions underserved. Vectors in those sparse regions get mapped to distant centroids, and recall plummets for queries that fall in the gaps. This is why most production systems retrain their codebooks periodically as data distributions shift.

But what happens when your quantization codebook becomes the bottleneck?

💡 Sponsored: Need fast hosting for WordPress, Node.js, or Python? Try Hostinger → (Affiliate link — we may earn a commission)

📬 Get AI Tool Reviews in Your Inbox

Weekly digest of the best new AI tools. No spam, unsubscribe anytime.

🎁

Built by us: Exit Pop Pro

Turn your WordPress visitors into email subscribers with an exit-intent popup that gives away a free PDF. $29 one-time — no monthly fees, no SaaS lock-in.

Get it →
📺 YouTube📘 Facebook