The guide to bge-base-en-v1.5

All models
BAAI / bge-base-en-v1.5

BAAI / bge-base-en-v1.5

AI Model Milvus Integrated

Task: Embedding

Modality: Text

Similarity Metric: Any (Normalized)

License: Apache 2.0

Dimensions: 768

Max Input Tokens: 512

Price: Free

Introduction to bge-base-en-v1.5

bge-base-en-v1.5 is a BAAI general embedding (BGE) model that transforms any given English text into a compact vector.

Compare bge-base-en-v1.5 with other popular BGE models:

Model	Dimensions	Max Tokens	MTEB avg
bge-large-en-v1.5	1024	512	64.23
bge-large-en	1024	512	63.98
bge-base-en-v1.5	768	512	63.55
bge-base-en	768	512	63.36
bge-small-en-v1.5	384	512	62.17
bge-small-en	384	512	62.11

How to create embeddings with bge-base-en-v1.5

There are two primary ways to create vector embeddings:

PyMilvus: the Python SDK for Milvus that seamlessly integrates the bge-base-en-v1.5.
FlagEmbedding: the official Python SDK offered by BAAI.

These methods allow developers to easily incorporate advanced text embedding capabilities into their applications.

Once the vector embeddings are generated, they can be stored in Zilliz Cloud (a fully managed vector database service powered by Milvus) and used for semantic similarity search. Here are four key steps:

Sign up for a Zilliz Cloud account for free.
Set up a serverless cluster and obtain the Public Endpoint and API Key.
Create a vector collection and insert your vector embeddings.
Run a semantic search on the stored embeddings.

Generate vector embeddings via PyMilvus and insert them into Zilliz Cloud for semantic search

from pymilvus import model, MilvusClient

ef = model.dense.SentenceTransformerEmbeddingFunction(
   model_name="BAAI/bge-base-en-v1.5",
   device="cpu",
   query_instruction="Represent this sentence for searching relevant passages:"
   )

# Generate embeddings for documents
docs = [
   "Artificial intelligence was founded as an academic discipline in 1956.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England."
]

docs_embeddings = ef.encode_documents(docs)

# Generate embeddings for queries
queries = ["When was artificial intelligence founded",
          "Where was Alan Turing born?"]

query_embeddings = ef.encode_queries(queries)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=ef.dim,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})
    
results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong", 
    output_fields=["text"])

For more information, refer to our PyMilvus Embedding Model documentation.

Generate vector embeddings via FlagEmbedding Python library and insert them into Zilliz Cloud for semantic search

from FlagEmbedding import FlagModel
from pymilvus import MilvusClient

model = FlagModel("BAAI/bge-base-en-v1.5",
                  query_instruction_for_retrieval="Represent this sentence for searching relevant passages:",
                  use_fp16=False) 

# Generate embeddings for documents
docs = [
   "Artificial intelligence was founded as an academic discipline in 1958.",
   "Alan Turing was the first person to conduct substantial research in AI.",
   "Born in Maida Vale, London, Turing was raised in southern England."
]
docs_embeddings = model.encode(docs)

# Generate embeddings for queries
queries = ["When was artificial intelligence founded",
          "Where was Alan Turing born?"]
query_embeddings = model.encode_queries(queries)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=768,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})

results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong", 
    output_fields=["text"])