Libraries like FAISS optimize vector search on CPUs through three main strategies: efficient indexing, SIMD parallelism, and memory hierarchy optimizations. First, FAISS uses indexing structures like Inverted File (IVF) to partition vectors into clusters, reducing search scope by focusing on relevant subsets. For example, IVF with 1,024 clusters allows searching only 10% of the dataset per query. Second, Single Instruction Multiple Data (SIMD) instructions (e.g., AVX2) accelerate distance computations by processing multiple vector elements in parallel. This is critical for brute-force or compressed-domain comparisons (e.g., using Product Quantization (PQ)). Third, FAISS minimizes cache misses by organizing data in contiguous blocks and precomputing lookup tables for quantized vectors, reducing redundant calculations during searches. These CPU optimizations prioritize latency reduction and memory efficiency, as RAM bandwidth and cache locality are key bottlenecks.
When using GPUs, FAISS shifts focus to maximizing parallelism and leveraging high memory bandwidth. GPUs excel at executing thousands of threads concurrently, so FAISS-GPU implements batched processing to compute distances across many vectors simultaneously. For example, a single query might compare against 1 million vectors in parallel using GPU cores, whereas CPUs would process these sequentially or in smaller batches. Memory access patterns also differ: GPU kernels optimize for coalesced memory reads (e.g., structuring vectors in GPU memory to minimize fragmented access) and use on-chip shared memory for frequently accessed data like query vectors. Additionally, GPU-specific quantization methods like residual compression are employed to reduce data transfer overheads between CPU and GPU. Unlike CPU-based precomputed tables, GPU kernels often recompute distances on-the-fly to avoid saturating limited GPU memory.
The key difference lies in how CPUs and GPUs balance parallelism and memory constraints. CPUs rely on thread-level parallelism (e.g., OpenMP) and SIMD for moderate throughput with low latency, while GPUs exploit massive data parallelism but require careful memory management to avoid bottlenecks. For instance, IVF on CPUs uses exact nearest-neighbor clustering, whereas GPU implementations might approximate clustering to reduce synchronization costs across threads. Similarly, PQ on CPUs stores precomputed lookup tables in RAM, while GPU versions may trade higher compute for reduced memory usage. These optimizations reflect the hardware trade-offs: CPUs prioritize flexibility and low latency for smaller batches, while GPUs target high throughput for large-scale, batched queries.