When dealing with vector search on datasets that exceed available RAM, three primary approaches can be employed: disk-based indexes, streaming data in chunks, and hierarchical indexing. Each method addresses memory limitations by optimizing how data is stored, accessed, or structured, but with trade-offs in speed, complexity, or accuracy.
1. Disk-Based Indexes
Disk-based indexing stores parts of the vector index on disk instead of loading it entirely into RAM. Libraries like FAISS (via its OnDiskInvertedLists
API) and Annoy support this by keeping a compressed or partitioned version of the index in memory while storing the bulk of the data on disk. For example, FAISS uses inverted file (IVF) structures to split the dataset into clusters, where only cluster metadata (e.g., centroids) are kept in RAM. During a search, the system identifies relevant clusters in memory and fetches their full vector data from disk. This reduces RAM usage but introduces latency due to disk I/O. To mitigate this, SSDs are often preferred over HDDs, as their faster read speeds help maintain acceptable query times. However, this approach requires careful tuning of cluster sizes and compression parameters to balance speed and accuracy.
2. Streaming Data in Chunks
Streaming involves processing the dataset incrementally by loading subsets (chunks) into RAM as needed. For instance, a dataset partitioned into shards can be searched sequentially: each shard is loaded, queried, and results are aggregated. Tools like memory-mapped files (e.g., using Python’s numpy.memmap
) allow efficient access to on-disk data without explicit chunking, letting the operating system handle paging. Another example is using databases like SQLite with vector extensions, which can stream results from disk. While streaming avoids high RAM usage, it may increase query latency due to repeated disk reads. This method works best for batch processing or when latency isn’t critical, such as offline recommendation systems. Optimizations include pre-filtering data (e.g., metadata-based partitioning) to reduce the number of chunks scanned.
3. Hierarchical Indexing Hierarchical methods structure the index into layers to narrow down search scope efficiently. For example, HNSW (Hierarchical Navigable Small World) builds a multi-layered graph where top layers contain coarse-grained references to lower layers. During a search, the algorithm starts at the top layer to find approximate candidates and refines results in deeper layers. For large datasets, higher layers can reside in RAM, while lower layers (containing detailed data) are stored on disk. Similarly, tree-based indexes like ANNOY split data into hierarchical clusters, where only a subset of tree nodes are loaded during traversal. This minimizes disk access but requires maintaining a balance between tree depth and node size. Hybrid approaches, such as combining IVF with HNSW, can further improve performance by using coarse in-memory indexes to guide disk-based searches.
In practice, these methods are often combined—for example, using hierarchical indexes to reduce disk I/O and streaming to handle partitions. The choice depends on latency requirements, dataset size, and hardware constraints (e.g., SSD vs. HDD). Tools like Milvus or Vespa abstract many of these optimizations, allowing developers to configure trade-offs without low-level implementation.