How can Lexical search reduce vector retrieval latency?

Lexical search can help reduce vector retrieval latency by acting as an efficient pre-filter before performing computationally expensive vector similarity searches. Vector retrieval often involves approximate nearest neighbor (ANN) algorithms, which operate over high-dimensional embeddings stored in systems like Milvus. While these are highly optimized, searching across millions or billions of embeddings can still be costly. Lexical search provides a lightweight way to narrow the candidate set quickly, using keyword-based filtering or BM25 scoring to select a subset of potentially relevant documents before performing the vector search.

For example, a system might first use Lexical search to select the top 1,000 documents based on exact or fuzzy term matching, and then pass those candidates into Milvus for semantic ranking. This significantly reduces the number of embeddings that need to be compared, lowering CPU or GPU workload and improving response time. Since Lexical search runs on inverted indexes, which are efficient even for large text collections, it can serve as a low-latency stage in hybrid retrieval pipelines.

By combining both approaches, developers achieve a balance between speed and semantic accuracy. Lexical search ensures that only textually relevant items reach the vector stage, avoiding wasteful embedding comparisons for unrelated documents. Milvus then performs deeper semantic ranking on a much smaller set, preserving quality without increasing latency. This two-tiered strategy—Lexical pre-filtering plus vector refinement—is common in production search systems where low latency and high relevance are both required, such as in chatbots, recommendation engines, or code search tools. It’s an effective engineering approach to optimize vector retrieval without sacrificing result quality.