The guide to clip-vit-base-patch32

All models
OpenAI / clip-vit-base-patch32

OpenAI / clip-vit-base-patch32

AI Model Zilliz Cloud Integrated

Task: Embedding

Modality: Multimodal

Similarity Metric: Any (Normalized)

License: Apache 2.0

Dimensions: 1536

Max Input Tokens: 77

Price: Free

Introduction to clip-vit-base-patch32

The CLIP model, developed by OpenAI, aims to understand robustness in computer vision tasks and test models' ability to generalize to new image classification tasks without prior training. The clip-vit-base-patch32 variant utilizes a ViT-B/32 Transformer architecture for image encoding and a masked self-attention Transformer for text encoding. By training these encoders to maximize the similarity of (image, text) pairs through contrastive loss, the model learns to associate images with corresponding textual descriptions.

How to create multimodal embeddings with clip-vit-base-patch32

There are two primary ways to generate vector embeddings:

Zilliz Cloud Pipelines: a built-in feature within Zilliz Cloud (the managed Milvus) that seamlessly integrates the clip-vit-base-patch32 model. It provides an out-of-box solution that simplifies creating and retrieving text or image vector embeddings.
SentenceTransformers: the Python library for sentence_transformers.

Once the vector embeddings are generated, they can be stored in Zilliz Cloud (a fully managed vector database service powered by Milvus) and used for semantic similarity search. Here are four key steps:

Sign up for a Zilliz Cloud account for free.
Set up a serverless cluster and obtain the Public Endpoint and API Key.
Create a vector collection and insert your vector embeddings.
Run a semantic search on the stored embeddings.

Generate vector embeddings via Zilliz Cloud Pipelines and perform a similarity search

Refer to the following resources for step-by-step instructions.

Generate vector embeddings via SentenceTransformer and insert them into Zilliz Cloud for similarity search

from PIL import Image
from sentence_transformers import SentenceTransformer
from pymilvus import MilvusClient
import requests

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

# Generate image embeddings
image_urls = [
    "https://raw.githubusercontent.com/milvus-io/milvus-docs/v2.4.x/assets/milvus_logo.png",
]
images = [Image.open(requests.get(url, stream=True).raw) for url in image_urls]
image_embeddings = model.encode(images)

# Generate text embeddings
queries = ["blue logo"]
query_embeddings = model.encode(queries)

# Connect to Zilliz Cloud with Public Endpoint and API Key
client = MilvusClient(
    uri=ZILLIZ_PUBLIC_ENDPOINT,
    token=ZILLIZ_API_KEY)

COLLECTION = "documents"
if client.has_collection(collection_name=COLLECTION):
    client.drop_collection(collection_name=COLLECTION)
client.create_collection(
    collection_name=COLLECTION,
    dimension=512,
    auto_id=True)

for image_url, embedding in zip(image_urls, image_embeddings):
    client.insert(COLLECTION, {"url": image_url, "vector": embedding})
    
results = client.search(
    collection_name=COLLECTION,
    data=query_embeddings,
    consistency_level="Strong", 
    output_fields=["text"])