OpenAI / clip-vit-base-patch32
Zilliz Cloud Integrated
Task: Embedding
Modality: Multimodal
Similarity Metric: Any (Normalized)
License: Apache 2.0
Dimensions: 1536
Max Input Tokens: 77
Price: Free
Introduction to clip-vit-base-patch32
The CLIP model, developed by OpenAI, aims to understand robustness in computer vision tasks and test models' ability to generalize to new image classification tasks without prior training. The clip-vit-base-patch32
variant utilizes a ViT-B/32 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. By training these encoders to maximize the similarity of (image, text) pairs through contrastive loss, the model learns to associate images with corresponding textual descriptions.
How to create multimodal embeddings with clip-vit-base-patch32
There are two primary ways to generate vector embeddings:
- Zilliz Cloud Pipelines: a built-in feature within Zilliz Cloud (the managed Milvus) that seamlessly integrates the
clip-vit-base-patch32
model. It provides an out-of-box solution that simplifies creating and retrieving text or image vector embeddings. - SentenceTransformers: the Python library for
sentence_transformers
.
Once the vector embeddings are generated, they can be stored in Zilliz Cloud (a fully managed vector database service powered by Milvus) and used for semantic similarity search. Here are four key steps:
- Sign up for a Zilliz Cloud account for free.
- Set up a serverless cluster and obtain the Public Endpoint and API Key.
- Create a vector collection and insert your vector embeddings.
- Run a semantic search on the stored embeddings.
Generate vector embeddings via Zilliz Cloud Pipelines and perform a similarity search
Refer to the following resources for step-by-step instructions.
Generate vector embeddings via SentenceTransformer and insert them into Zilliz Cloud for similarity search
from PIL import Image
from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
import requests
#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')
# Generate image embeddings
image_urls = [
"https://raw.githubusercontent.com/milvus-io/milvus-docs/v2.4.x/assets/milvus_logo.png",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
image_embeddings = model.encode(images)
# Generate text embeddings
queries = ["blue logo"]
query_embeddings = model.encode(queries)
# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
uri=ZILLIZ_PUBLIC_ENDPOINT,
token=ZILLIZ_API_KEY)
COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
client.drop_collection(collection_name=COLLECTION)
client.create_collection(
collection_name=COLLECTION,
dimension=512,
auto_id=True)
for image_url, embedding in zip(image_urls, image_embeddings):
client.insert(COLLECTION, {"url": image_url, "vector": embedding})
results = client.search(
collection_name=COLLECTION,
data=query_embeddings,
consistency_level="Strong",
output_fields=["text"])
For more information, refer to the model page on HuggingFace.
- Introduction to clip-vit-base-patch32
- How to create multimodal embeddings with clip-vit-base-patch32
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for Free