Learn
Vector Database 101: Everything You Need to Know

Scalar Quantization and Product Quantization

Mar 29, 202310 min read

A hands-on dive into scalar quantization (integer quantization) and product quantization with Python.

Read the entire series

In the previous tutorial, we took a look at two of the most fundamental indexing algorithms - flat indexing and inverted file (IVF). Flat indexing is the simplest and most naive vector search strategy, but can surprisingly work quite well when the total dataset size is small (a couple thousand vectors) and/or if you're using a GPU for querying. IVF, on the other hand, is highly extensible to be highly extensible while working well with other indexing strategies too.

In this tutorial, we'll build on top of that knowledge by diving deeper into quantization techniques - specifically scalar quantization (also called integer quantization) and product quantization. We'll implement our own scalar and product quantization algorithms in Python.

Quantization vs. Dimensionality Reduction

As mentioned in a previous tutorial on vector similarity search, quantization is a technique for reducing the total size of the database by reducing the overall precision of the vectors. Note that this is very different from dimensionality reduction (PCA, LDA, etc), which attempts to reduce the length of the vectors:

>>> vector.size  # length of our original vector
128
>>> quantized_vector.size  # length of our quantized vector
128
>>> reduced_vector.size  # length of our reduced vector
16

Dimensionality reduction methods such as PCA use linear algebra to project the input data into a lower dimensional space. Without getting too deep into the math here, just know that these methods generally aren't used as the primary indexing strategy because they tend to have limitations on the distribution of the data. PCA, for example, works best on data that can be separated into independent, Gaussian distributed components.

Quantization, on the other hand, makes no assumption about the distribution of the data - rather, it looks at each dimension (or group of dimensions) separately and attempts to "bin" each value into one of many discrete buckets. Instead of performing a flat search over all of the original vectors, we can instead perform flat search over the quantized vectors - this can result in reduced memory consumption as well as significant speedup.

Scalar quantization

Scalar quantization is an important data compression technique that turns floating point values into low-dimensional integers. Let's take a look at an example:

>>> vector.dtype  # data type of our original vector
dtype('float64')
>>> quantized_vector.dtype
dtype('int8')
>>> reduced_vector.dtype
dtype('float64')

From the above example, we can see that scalar quantization has reduced the total size of our vectors (and vector database) by a whopping 8x. Nice.

How does scalar quantization work?

How exactly does scalar quantization work? Let's first take a look at the indexing process, i.e. turning floating point vectors into integer vectors. For each vector dimension, scalar quantization takes the maximum and minimum value of that particular dimension as seen across the entire database, and uniformly splits that dimension into bins across its entire range.

Let's try writing that in code. We'll first generate a dataset of a thousand 128D floating point vectors sampled from a multivariate distribution. Since this is a toy example, I'll be sampling from a Gaussian distribution; in practice, actual embeddings are rarely Gaussian distributed unless added as a constraint when training the model (such as in variational autoencoders):

>>> import numpy as np
>>> dataset = np.random.normal(size=(1000, 128))

This dataset serves as our dummy data for use in this scalar quantization implementation. Now, let's determine the maximum and minimum values of each dimension of the vector and store it in a matrix called ranges:

>>> ranges = np.vstack((np.min(dataset, axis=0), np.max(dataset, axis=0)))

You'll notice that the mins and maxes here are fairly uniform across all dimensions for the this toy example since the input data was sampled from zero-mean unit-variance Gaussians - don't worry about that for now, as all the code here translates to real data as well. We now have the minimum and maximum value of each vector dimension in the entire dataset. With this, we can now determine the start value and step size for each dimension. The start value is simply the minimum value, and the step size is determined by the number of discrete bins in the integer type that we'll be using. In this case, we'll be using 8-bit unsigned integers uint8_t for a total of 256 bins:

>>> starts = ranges[0,:]
>>> steps = (ranges[1,:] - ranges[0,:]) / 255

That's all the setup that's needed. The actual quantization is then done by subtracting all starting values for each dimension (starts) and dividing the resulting value by the step size (steps):

>>> dataset_quantized = np.uint8((dataset - starts) / steps)
>>> dataset_quantized
array([[136,  58, 156, ..., 153, 182,  30],
       [210,  66, 175, ...,  68, 146,  33],
       [100, 136, 148, ..., 142,  86, 108],
       ...,
       [133, 146, 146, ..., 137, 209, 144],
       [ 63, 131,  96, ..., 174, 174, 105],
       [159,  78, 204, ...,  95,  87, 146]], dtype=uint8)

We can also check the mins and maxes of the quantized dataset:

>>> np.min(dataset_quantized, axis=0)
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=uint8)
>>> np.max(dataset_quantized, axis=1)
array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 254, 254, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 254, 255, 255, 255, 255, 254, 255, 255, 254,
       254, 255, 255, 254, 255, 255, 255, 254, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 254, 255, 255, 255, 255, 255, 254, 255, 255,
       255, 254, 255, 255, 255, 255, 255, 254, 254, 255, 255, 255, 255,
       255, 255, 255, 254, 255, 255, 255, 255, 254, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
       255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255], dtype=uint8)

Note how we've used the full uint8_t range of [0,255] for each dimension (some of the maximum values are 254 instead of 255 due to floating point imprecisions).

Now let's put it all together in a ScalarQuantizer class:

import numpy as np


class ScalarQuantizer:

    def __init__(self):
        self._dataset = None

    def create(self):
        """Calculates and stores SQ parameters based on the input dataset."""
        self._dtype = dataset.dtype  # original dtype
        self._starts = np.min(dataset, axis=1)
        self._steps = (np.max(dataset, axis=1) - self._starts) / 255

        # the internal dataset uses `uint8_t` quantization
        self._dataset = np.uint8((dataset - self._starts) / self._steps)

    def quantize(self, vector):
        """Quantizes the input vector based on SQ parameters"""
        return np.uint8((vector - self._starts) / self._steps)

    def restore(self, vector):
        """Restores the original vector using SQ parameters."""
        return (vector * self._steps) + self._starts

    @property
    def dataset(self):
        if self._dataset:
            return self._dataset
        raise ValueError("Call ScalarQuantizer.create() first")


>>> dataset = np.random.normal(size=(1000, 128))
>>> quantizer = ScalarQuantizer(dataset)

And that's it for scalar quantization! The conversion function for scalar quantization can be changed to a quadratic or exponential function instead of a linear function as we have in the example above, but the general idea remains the same - split the entire space of each dimension into discrete bins in order to reduce the overall memory footprint of the vector data.

Product quantization

Product quantization (PQ) is another data compression technique that is much more powerful and flexible to quantize vectors when compared with scalar quantization. A major disadvantage of scalar quantization is that it does not take into consideration the distribution of values in each dimension. For example, imagine we have a dataset with the following 2-dimensional vectors:

array([[ 9.19,  1.55],
       [ 0.12,  1.55],
       [ 0.40,  0.78],
       [-0.04,  0.31],
       [ 0.81, -2.07],
       [ 0.29,  0.82],
       [ 0.05,  0.96],
       [ 0.12, -1.10]])

If we decide to quantize these vectors in to 3-bit integers (range [0,7]), 6 of the bins for the 0th dimension will be completely unused! Clearly, there must be a better way to perform quantization, especially if any of the dimensions have a non-uniform distribution. That's where product quantization can come to the rescure.

How does product quantization work?

The primary idea behind product quantization is to algorithmically split a high-dimensional vector into a lower dimensional subspace, with dimension of the subspace corresponding to multiple dimensions in the original high-dimensional vector. This reduction process is typically done using a special algorithm called the Lloyd's algorithm, a quantizer which is effectively equivalent to k-means clustering. Like scalar quantization, each original vector results in a vector of integers post-quantization, and each integer corresponds with a particular centroid.

This might sound complex, but it becomes much easier to understand if we break it down from an algorithmic perspective. Let's first go over indexing:

Given a dataset of N total vectors, we'll first divide each vector into M subvectors (also known as a subspace). These subvectors don't necessarily have to be the same length, but in practice they almost always are.
We'll then use k-means (or some other clustering algorithm) for all subvectors in the dataset. This will give us a collection of K centroids for each subspace, each of which will be assigned its own unique ID.
With all centroids computed, we'll replace all subvectors in the original dataset with the ID of its closest centroid.

_{Product quantization, visualized.}

PQ can both reduce memory usage and signficantly speed up nearest neighbor search speeds at the cost of a bit of accuracy. The trade-off is dependent on the parameters used - using more centroids and subvectors will improve search accuracy but will not result in as much compression nor speedup.

Let's work through a very simple PQ implementation. As with the previous scalar quantization example, we'll be reducing the vectors into 8-bit unsigned integers (uint8_t) using M = 16 and K = 256, i.e. each 128D vector will be split into 16 subvectors of size 8, with each subvector then being quantized into one of 256 buckets.

>>> (M, K) = (16, 256)

We'll start with generating a toy dataset, as we did for the scalar quantization example:

>>> import numpy as np
>>> dataset = np.random.normal(size=(1000, 128))

With the dataset in hand, let's split off the first set of subvectors:

>>> sublen = dataset.shape[1] // M
>>> subspace = dataset[:,0:sublen]  # this is the 0th subspace
>>> subspace.shape
(1000, 8)

As with IVF, we'll use scipy's kmeans2 implementation to determine centroids:

>>> from scipy.cluster.vq import kmeans2
>>> (centroids, assignments) = kmeans2(subspace, K, iter=32)

The scipy k-means implementation returns the centroid indices in int32_t format, so we'll do a quick conversion to uint8_t to wrap things up:

>>> quantized = np.uint8(assignments)

This process gets repeated for each subspace until we've quantized all of the vectors.

As we've done for scalar quantization, let's also compile everything here into a class:

import numpy as np
from scipy.cluster.vq import kmeans2


class ProductQuantizer:

    def __init__(self, M=16, K=256):
        self.M = 16
        self.K = 256
        self._dataset = None

    def create(self, dataset):
        """Fits PQ model based on the input dataset."""
        sublen = dataset.shape[1] // self.M
        self._centroids = np.empty((self.M, self.K, sublen), dtype=np.float64)
        self._dataset = np.empty((dataset.shape[0], self.M), dtype=np.uint8)
        for m in range(self.M):
            subspace = dataset[:,m*sublen:(m+1)*sublen]
            (centroids, assignments) = kmeans2(subspace, self.K, iter=32)
            self._centroids[m,:,:] = centroids
            self._dataset[:,m] = np.uint8(assignments)

    def quantize(self, vector):
        """Quantizes the input vector based on PQ parameters"""
        quantized = np.empty((self.M,), dtype=np.uint8)
        for m in range(self.M):
            centroids = self._centroids[m,:,:]
            distances = np.linalg.norm(vector - centroids, axis=1)
            quantized[m] = np.argmin(distances)
        return quantized

    def restore(self, vector):
        """Restores the original vector using PQ parameters."""
        return np.hstack([self._centroids[m,vector[m],:] for m in range(M)])

    @property
    def dataset(self):
        if self._dataset:
            return self._dataset
        raise ValueError("Call ProductQuantizer.create() first")


>>> dataset = np.random.normal(size=(1000, 128))
>>> quantizer = ProductQuantizer()
>>> quantizer.create(dataset)

And that's it for product quantization! To really speed things up during the search process, we can expend a bit of extra memory to compute distance tables for all centroids in each subspace, but I'll leave that as a exercise for you.

Wrapping up

In this tutorial, we did a deep dive into scalar quantization and product quantization, creating our own simple implementations along the way. Scalar quantization is a good tool, but product quantization is much more powerful and can be used regardless of the distribution of our vector data. Please keep in mind that, while PQ can help significantly speed up query times while also reducing memory footprint, it is generally not that great for recall. We'll benchmark PQ along with several other indexing strategies in a future tutorial.

In the next tutorial, we'll continue our deep dive into indexing strategies with Hierarchical Navigable Small Worlds (HNSW) - a graph-based indexing algorithm that is arguably the most popular way to index vectors today (although it does come with its own downsides as well). See you in the next tutorial!

All code for this tutorial is freely available on Github: https://github.com/fzliu/vector-search.

Take another look at the Vector Database 101 courses

Updated on May 28, 2026

Zilliz

Next: Hierarchical Navigable Small Worlds (HNSW)

Content

Start Free, Scale Easily

Try the fully-managed vector database built for your GenAI applications.

Try Zilliz Cloud for Free

Keep Reading

Approximate Nearest Neighbors Oh Yeah (Annoy)

Discover the capabilities of Annoy, an innovative algorithm revolutionizing approximate nearest neighbor searches for enhanced efficiency and precision.

Harnessing Product Quantization for Memory Efficiency in Vector Databases

Exploring product quantization's intricacies and practical implementation through hands-on examples.

Ensuring High Availability of Vector Databases

Ensuring high availability is crucial for the operation of vector databases, especially in applications where downtime translates directly into lost productivity and revenue.