The preprocessing needed for embed-multilingual-v3.0 is intentionally simple, but it must be consistent and language-aware. At a minimum, you should clean obvious noise such as HTML boilerplate, navigation text, repeated headers, and footer content. Normalize whitespace and encoding, but avoid aggressive transformations that change meaning, especially in non-Latin scripts. Unlike keyword search, embeddings benefit from preserving natural sentence structure and context, so the goal is clarity, not heavy normalization.
Chunking is the most important preprocessing step. Long documents should be split into semantically coherent chunks rather than arbitrary character lengths. For multilingual content, chunk by structure (headings, paragraphs, sections) rather than language-specific rules. Each chunk should carry enough context to be meaningful on its own. When storing embeddings in a vector database such as Milvus or Zilliz Cloud, always attach metadata: language, doc_id, title, section, source_url, and possibly region or product. This metadata enables language-aware retrieval and filtering later, which is critical in multilingual systems.
For queries, preprocessing should be minimal. Trim whitespace, keep the user’s original wording, and avoid forced translation unless your product explicitly requires it. embed-multilingual-v3.0 is designed to handle multilingual input directly, so over-processing queries can actually hurt retrieval. If your users often mix languages, keep that mixed input intact. Over time, you can refine preprocessing by inspecting failure cases: if certain languages or scripts perform poorly, you may add lightweight helpers such as embedding translated titles or summaries alongside original text. The key principle is consistency: whatever preprocessing you apply at ingestion must also be applied, conceptually, at query time.
For more resources, click here: https://zilliz.com/ai-models/embed-multilingual-v3.0
