Microsoft / multilingual-e5-large
Milvus Integrated
Task: Embedding
Modality: Text
Similarity Metric: Any (Normalized)
License: Mit
Dimensions: 1024
Max Input Tokens: 512
Price: Free
Introduction to the multilingual-e5-large embedding model
- Tailored for multilingual documents; supports 100+ languages; ideal for multilingual information retrieval and semantic search tasks.
The multilingual-e5-large
model is a state-of-the-art text embedding model developed by Microsoft based on the XLM-RoBERTa-large architecture. With its 24-layer structure and 560 million parameters, the multilingual-e5-large
model generates 1024-dimensional embeddings and supports 100 languages, offering robust performance even in multilingual contexts.
Trained on a billion weakly supervised text pairs and fine-tuned on specific datasets, the model excels in multilingual information retrieval and semantic search tasks. It processes text inputs prefixed with "query:" or "passage:" to create embeddings that accurately reflect semantic content. This model demonstrates superior performance in multilingual benchmarks, surpassing smaller models and traditional methods, making it ideal for cross-lingual text analysis, clustering, and similarity comparisons.
How to create vector embeddings with the multilingual-e5-large model
There are two primary ways to create vector embeddings with the multilingual-e5-large
model:
- PyMilvus: the Python SDK for Milvus that seamlessly integrates with the
multilingual-e5-large
model. - SentenceTransformer library: the Python library of
sentence-transformer
.
Once the vector embeddings are generated, they can be stored in Zilliz Cloud (a fully managed vector database service powered by Milvus) and used for semantic similarity search. Here are four key steps:
- Sign up for a Zilliz Cloud account for free.
- Set up a serverless cluster and obtain the Public Endpoint and API Key.
- Create a vector collection and insert your vector embeddings.
- Run a semantic search on the stored embeddings.
Generate vector embeddings via PyMilvus and insert them into Zilliz Cloud for semantic search
from pymilvus.model.dense import SentenceTransformerEmbeddingFunction
from pymilvus import MilvusClient
ef = SentenceTransformerEmbeddingFunction("intfloat/multilingual-e5-large")
docs = [
"passage: Artificial intelligence was founded as an academic discipline in 1956.",
"passage: Alan Turing war die erste Person, die umfassende Forschungen im Bereich der künstlichen Intelligenz durchgeführt hat.",
"passage: 图灵出生在伦敦的梅达维尔,他在英格兰南部长大。"
]
# Generate embeddings for documents
docs_embeddings = ef(docs)
queries = ["query: When was artificial intelligence founded",
"query: Wo wurde Alan Turing geboren?"]
# Generate embeddings for queries
query_embeddings = ef(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=ef.dim,
auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):
client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
For more information, refer to our PyMilvus Embedding Model documentation.
Generate vector embeddings via SentenceTransformer and insert them into Zilliz Cloud for semantic search
from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
model = SentenceTransformer("intfloat/multilingual-e5-large")
docs = [
"passage: Artificial intelligence was founded as an academic discipline in 1956.",
"passage: Alan Turing war die erste Person, die umfassende Forschungen im Bereich der künstlichen Intelligenz durchgeführt hat.",
"passage: 图灵出生在伦敦的梅达维尔,他在英格兰南部长大。"
]
# Generate embeddings for documents
docs_embeddings = model.encode(docs, normalize_embeddings=True)
queries = ["query: When was artificial intelligence founded",
"query: Wo wurde Alan Turing geboren?" ]
# Generate embeddings for queries
query_embeddings = model.encode(queries, normalize_embeddings=True)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=1024,
auto_id=True)
for doc, embedding in zip(docs, docs_embeddings):
client.insert(COLLECTION, {"text": doc, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
For more information, refer to SentenceTransformer documentation.
- Introduction to the multilingual-e5-large embedding model
- How to create vector embeddings with the multilingual-e5-large model
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free