What data structures optimize Lexical search performance?

Lexical search performance is optimized by specialized data structures that efficiently store and retrieve text tokens, with the inverted index being the most essential. An inverted index maps each unique term in the corpus to a list of documents (or positions within documents) that contain that term. This structure allows the system to quickly locate all occurrences of a given word without scanning the entire dataset. Each entry typically includes term frequency, document frequency, and positional data to support scoring algorithms like TF-IDF and BM25. For example, when searching “vector database,” the inverted index directly retrieves all documents containing both “vector” and “database,” enabling fast intersection and ranking.

Beyond inverted indexes, posting lists and skip lists further improve query speed. Posting lists store document IDs associated with each term in sorted order, while skip lists introduce “jump pointers” to skip large portions of irrelevant data during search. These structures allow the engine to efficiently perform Boolean operations such as AND, OR, and NOT between multiple query terms. Tries and prefix trees are also used for autocomplete or prefix-matching features, enabling quick lookups for queries like “datab…” when typing “database.” Additionally, Bloom filters can be used to reduce disk reads by quickly checking whether a term might exist in a given document before performing a more expensive lookup.

When integrated with a vector database such as Milvus, these Lexical data structures complement vector indexes like HNSW or IVF by handling the symbolic component of hybrid search. Lexical structures handle fast exact matching and filtering, while Milvus manages dense embedding comparisons. Developers can use Lexical pre-filters to reduce the search space before running semantic similarity computations. Together, inverted indexes and vector indexes form the backbone of modern hybrid retrieval systems, combining precision with conceptual relevance efficiently at scale.