Lexical search handles multilingual text content by tokenizing and indexing words according to the language-specific rules of each corpus. In essence, Lexical search relies on how words are split and normalized before indexing, and these processes vary between languages. For example, English uses whitespace tokenization, but languages like Chinese or Japanese require segmentation models to identify word boundaries. Many Lexical systems use analyzers such as stemming, lowercasing, and stop-word removal that are tuned for specific languages. When a document collection includes multiple languages, developers often create separate indexes per language or use analyzers capable of detecting language automatically at indexing time.
However, Lexical search has inherent challenges with multilingual data because it matches tokens literally. This means that searching for “car” will not return documents that only contain “auto” in German or “voiture” in French. Developers can mitigate this by using language-specific synonyms lists or multilingual analyzers that map equivalent words across languages. Another approach is to apply translation preprocessing—translating all queries or documents into a common language before indexing or searching. This method ensures consistency but adds latency and can affect accuracy if translations are imprecise.
When paired with a vector database such as Milvus, Lexical search’s weaknesses in multilingual scenarios can be offset by embedding-based retrieval. Embeddings represent words and phrases in a shared semantic space where similar meanings align across languages. This means Milvus can retrieve conceptually similar content even when the text is written in different languages. Developers can then use Lexical search for precision filtering and vector retrieval for semantic matching. Together, they provide a balanced solution: Lexical search ensures exact token control per language, while vector similarity extends understanding across linguistic boundaries, enabling robust multilingual search applications.
