Increasing the number of probes (e.g., nprobe
in FAISS IVF indexes) or search depth parameters (e.g., efSearch
in HNSW graphs) directly increases query latency because the system evaluates more candidate data points or traverses more nodes. For example, a higher nprobe
forces a vector database to search more clusters, while a larger efSearch
expands the candidate pool in a graph-based index. This expanded search improves recall by reducing the chance of missing true nearest neighbors, but it also requires more compute operations and memory access, which slows down queries. The relationship is often linear or sub-linear: doubling nprobe
might roughly double latency, depending on the index structure and data distribution.
To find an optimal balance, start by benchmarking recall and latency across a range of parameter values. For instance, test nprobe
from 10 to 200 in increments of 20, or efSearch
from 50 to 500, while measuring recall (e.g., recall@10) and query time. Plotting these results on a trade-off curve helps identify the "knee" where further increases in the parameter yield diminishing recall gains but significant latency penalties. For example, if recall plateaus at nprobe=80
while latency continues rising, that value becomes a candidate for optimization. Use application-specific requirements to set thresholds—a recommendation system might tolerate 50ms latency for 95% recall, while a real-time app might cap latency at 10ms even if recall drops to 85%.
Practical tuning also depends on data characteristics. High-dimensional datasets or those with dense clusters often require larger nprobe
or efSearch
to maintain recall. Tools like FAISS’s autotune
or HNSW’s benchmarking scripts can automate parameter sweeps. Additionally, hardware constraints (e.g., CPU cache size, GPU memory) influence limits—excessive efSearch
might cause cache thrashing, degrading performance. Validate settings with a holdout dataset to ensure generalization. Iterate by gradually increasing parameters until latency exceeds acceptable bounds, then scale back slightly to find the best compromise.