Vector quantization (VQ) in embeddings compresses high-dimensional vectors into a smaller set of representative vectors (called centroids) to reduce storage and improve computational efficiency. This is achieved by partitioning the vector space into clusters using algorithms like k-means, where each cluster is represented by a centroid. Each embedding is then approximated by the centroid of its assigned cluster.
The quantized vectors are stored as indices of the centroids rather than the original embeddings, significantly reducing memory usage. For example, in Approximate Nearest Neighbor (ANN) search, VQ allows large-scale embedding data to be processed efficiently.
However, vector quantization introduces approximation errors that may slightly reduce accuracy in downstream tasks. The trade-off between compression and precision must be carefully balanced based on the application's requirements. Modern methods like Product Quantization (PQ) extend this idea for higher scalability and efficiency.