The guide to jina-embeddings-v2-base-zh

All models
Jina AI / jina-embeddings-v2-base-zh

Jina AI / jina-embeddings-v2-base-zh

AI Model Milvus Integrated

Task: Embedding

Modality: Text

Similarity Metric: Any (Normalized)

License: Apache 2.0

Dimensions: 768

Max Input Tokens: 8192

Price: Free

Introduction to Jina Embedding v2 Models

Jina Embeddings v2 models are designed to handle long documents with an expanded max input size of 8,192 tokens. As of October 2024, Jina AI Embedding V2 has the following variants, each catering to different embedding needs.

What is jina-embeddings-v2-base-zh

jina-embeddings-v2-base-zh is a bilingual (Chinese/English) text embedding tool that can process up to 8192 tokens per sequence. It's built on a specialized BERT architecture (called JinaBERT) for monolingual and cross-lingual applications.

Comparing jina-embeddings-v2-base-zh with other Jina embedding models.

Model	Parameter Size	Embedding Dimension	Text
jina-embeddings-v3	570M	flexible embedding size (Default: 1024)	multilingual text embeddings; supports 94 language in total
jina-embeddings-v2-small-en	33M	512	English monolingual embeddings
jina-embeddings-v2-base-en	137M	768	English monolingual embeddings
jina-embeddings-v2-base-zh	161M	768	Chinese-English Bilingual embeddings
jina-embeddings-v2-base-de	161M	768	German-English Bilingual embeddings
jina-embeddings-v2-base-code	161M	768	English and programming languages

How to create embeddings using jina-embeddings-v2-base-zh

There are two primary ways to generate vector embeddings:

PyMilvus: the Python SDK for Milvus that seamlessly integrates the jina-embeddings-v2-base-zh model.
SentenceTransformer library: the Python library sentence-transformer.

Once the vector embeddings are created, they can be stored in a vector database like Zilliz Cloud (a fully managed vector database powered by Milvus) and used for semantic similarity search.

Here are four key steps:

Sign up for a Zilliz Cloud account for free.
Set up a serverless cluster and obtain the Public Endpoint and API Key.
Create a vector collection and insert your vector embeddings.
Run a semantic search on the stored embeddings.

Create embeddings via PyMilvus and insert them into Zilliz Cloud for semantic search

from pymilvus.model.dense import SentenceTransformerEmbeddingFunction
from pymilvus import MilvusClient

ef = SentenceTransformerEmbeddingFunction("jinaai/jina-embeddings-v2-base-zh", trust_remote_code=True)

docs = [
   "人工智能于1956年作为一门学术学科成立。",
   "艾伦·图灵是第一位在人工智能领域进行实质性研究的人。",
   "图灵出生于伦敦的梅达韦尔，在英格兰南部长大。"
]
# Generate embeddings for documents
docs_embeddings = ef(docs)

queries = ["人工智能是什么时候创立的？",
          "艾伦·图灵出生在哪里？"]
# Generate embeddings for queries
query_embeddings = ef(queries)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=ef.dim,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})
    
results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong",
    output_fields=["text"])

For details, refer to our [PyMilvus Embedding Model documentation](For more information, refer to our PyMilvus Embedding Model documentation.).

Create embeddings via the SentenceTransformer library and insert them into Zilliz Cloud for semantic search

from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient

model = SentenceTransformer("jinaai/jina-embeddings-v2-base-zh", trust_remote_code=True)

docs = [
   "人工智能于1956年作为一门学术学科成立。",
   "艾伦·图灵是第一位在人工智能领域进行实质性研究的人。",
   "图灵出生于伦敦的梅达韦尔，在英格兰南部长大。"
]
# Generate embeddings for documents
docs_embeddings = model.encode(docs, normalize_embeddings=True)

queries = ["人工智能是什么时候创立的？",
          "艾伦·图灵出生在哪里？"]
# Generate embeddings for queries
query_embeddings = model.encode(queries, normalize_embeddings=True)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=512,
    auto_id=True)

for doc, embedding in zip(docs, docs_embeddings):
    client.insert(COLLECTION, {"text": doc, "vector": embedding})
    
results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong",
    output_fields=["text"])