Product Quantization (PQ) reduces the memory footprint of a vector index by compressing high-dimensional vectors into compact codes. Instead of storing full-precision floating-point values (e.g., 32-bit or 64-bit per dimension), PQ splits a vector into subvectors and replaces each subvector with a smaller integer code. For example, a 128-dimensional vector might be divided into 8 subvectors of 16 dimensions each. Each subvector is then mapped to the closest centroid in a pre-trained codebook (e.g., 256 centroids per subvector), allowing each centroid index to be stored as an 8-bit integer. This reduces storage from 128×4 bytes (512 bytes for 32-bit floats) to 8×1 byte (8 bytes), achieving a 64x compression. The compressed codes are used during search to approximate distances between vectors efficiently.
The compression impacts search recall and precision due to information loss. PQ introduces quantization error—the difference between the original vector and its compressed form. This error can cause approximate nearest neighbor searches to miss some true matches (lower recall) or return less relevant results (lower precision). However, PQ balances this by using multiple subquantizers and codebooks to preserve vector structure. For instance, splitting vectors into more subvectors (higher m) reduces per-subvector error but requires larger codebooks. In practice, PQ often achieves acceptable recall-precision trade-offs for large-scale applications. For example, in image retrieval, PQ might retain 90% recall while using 1/50th of the original memory, but exact results depend on dataset characteristics and parameter tuning.
The key trade-off lies in codebook design and search strategy. Larger codebooks (more centroids per subvector) reduce quantization error, improving recall and precision at the cost of memory and training time. Hierarchical search methods, like inverted file systems with PQ, mitigate accuracy loss by combining coarse quantization (to filter candidates) with refined PQ-based distance calculations. For example, Facebook's FAISS library uses PQ with IVF to scale to billion-sized datasets while maintaining practical accuracy. Developers must tune parameters like the number of subvectors (m) and centroids (k) based on their memory constraints and accuracy requirements.
