Using disk-based approximate nearest neighbor (ANN) methods, where part of the index resides on SSDs or HDDs, typically increases query latency compared to fully in-memory indices. This is due to the slower access times of disk storage compared to RAM. For example, accessing data from RAM takes nanoseconds, while even fast SSDs introduce microsecond-level delays, and HDDs can add milliseconds due to mechanical seek times. When queries require loading portions of the index from disk, the I/O overhead becomes a bottleneck. This latency grows with the number of disk accesses per query, especially for algorithms that rely on random access patterns, which are inefficient on traditional storage media.
The impact depends on the algorithm's design and how it balances disk usage. Methods optimized for sequential disk reads (e.g., grouping related data blocks) can mitigate latency by reducing random I/O. For instance, DiskANN minimizes random accesses by organizing vectors to maximize locality on disk, allowing bulk reads during queries. However, even with optimizations, disk-based ANN systems often require larger batch sizes or pre-fetching strategies to amortize I/O costs. In contrast, in-memory indices like HNSW (Hierarchical Navigable Small World) avoid disk entirely, enabling low-latency, single-vector queries without I/O stalls. The gap widens for complex queries involving multiple hops or large candidate pools, where in-memory traversal is significantly faster.
The trade-off is capacity versus speed. Disk-based ANN allows handling datasets that exceed RAM limits—critical for applications like billion-scale image retrieval—but at the cost of higher and less predictable latency. For example, a fully in-memory index might answer queries in 1-2 milliseconds, while a disk-optimized system could take 10-50 milliseconds, depending on SSD speed and data layout. Developers must choose based on their workload: in-memory is preferable for real-time applications (e.g., recommender systems), while disk-based methods suit scenarios where scaling to massive data justifies slower responses. Hybrid approaches, like caching frequently accessed index portions in RAM, can partially bridge this gap but add complexity.
