The preprocessing that works best with embed-english-light-v3.0 is simple, consistent, and focused on preserving meaning while removing noise. Start by extracting the user-relevant text: remove navigation menus, cookie banners, repeated headers/footers, and template boilerplate. Normalize whitespace, keep sentence structure intact, and avoid transformations that change semantics. For developer-facing content, preserve code identifiers, error messages, configuration keys, and product names, because users often search with those exact strings and they anchor retrieval.
Chunking is the most important “preprocessing” decision for embedding-based search. Split documents into coherent passages that each express one idea, and consider adding a short prefix that provides context, such as the page title or section heading. A practical pattern is to embed a string like "Title: ...\nSection: ...\nContent: ..." for each chunk, then store it alongside metadata. If you’re using a vector database such as Milvus or Zilliz Cloud, include metadata fields that enable filters and better UX: doc_id, source_url, product, version, and updated_at. Also consider deduplicating near-identical chunks before embedding to reduce cost and prevent repetitive search results.
Finally, keep preprocessing stable across time and environments. If you change chunk size, heading prefixes, or cleaning rules, you effectively change the meaning distribution of your vectors, which can cause quality shifts and make debugging hard. Version your preprocessing pipeline, store the pipeline version with each vector record, and re-embed intentionally when you change it. For queries, prefer minimal processing: trim whitespace and keep the user’s wording, because “helpful normalization” can easily remove valuable intent signals.
For more resources, click here: https://zilliz.com/ai-models/embed-english-light-v3.0
