The most important limitation is scope: jina-embeddings-v2-small-en is English-only, so it is not a safe default for multilingual datasets. If your corpus includes Japanese, Chinese, Spanish, or mixed-language content, embeddings may become noisy and similarity search quality will degrade. In a real system, this often shows up as confusing retrieval results where unrelated documents cluster together simply because the model cannot represent the non-English text well. If your data is “mostly English,” you still need to detect and handle non-English segments during ingestion to avoid polluting the vector index.
A second limitation is that being “small” is a tradeoff. Smaller models are generally fast and cost-effective, but they can struggle with fine-grained distinctions when many documents are very similar. For example, two API reference pages might share 90% overlapping phrasing but differ in a single parameter meaning. The embeddings may be close enough that both pages appear in the same top-k results, which is not wrong, but it may force you to add reranking logic or stronger metadata filters. In systems using Milvus or Zilliz Cloud, this often means leaning on structured fields (product, version, language, permissions) to narrow the candidate set before vector similarity is applied.
A third limitation is document structure handling. The model treats whatever you feed it as plain text; it does not inherently “understand” Markdown tables, code blocks, or HTML layout in a special way. If you embed raw scraped pages with lots of navigation text, repeated headers, or code snippets, retrieval quality can suffer. The fix is usually simple but non-optional: clean the text, remove boilerplate, chunk at logical boundaries, and store helpful metadata. Most retrieval issues blamed on “the model” are actually pipeline issues: chunking strategy, preprocessing consistency, and evaluation methodology.
For more information, click here: https://zilliz.com/ai-models/jina-embeddings-v2-small-en
