Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning
This article will explore CLIP's inner workings and pioneering potential in multimodal learning.
Read the entire series
- OpenAI's ChatGPT
- Unlocking the Secrets of GPT-4.0 and Large Language Models
- Top LLMs of 2024: Only the Worthy
- Large Language Models and Search
- Falcon 180B: Advancing Language Models in the AI Frontier
- OpenAI Whisper: Transforming Speech-to-Text with Advanced AI
- Exploring OpenAI CLIP: The Future of Multi-Modal AI Learning
- What are Private LLMs? Running Large Language Models Privately - privateGPT and Beyond
Artificial intelligence (AI) is dramatically transforming and witnessing a shift from traditional approaches to a new approach, multimodal AI learning. These systems can get input and understand information from various modalities as humans do. Text, images, and audio can be processed together, leading to a deeper and more refined understanding of the world.
At the forefront of this multimodal learning revolution stands OpenAI's CLIP (Contrastive Language-Image Pre-training), a pioneering model for text and image data. CLIP has propelled AI learning to new heights, expanding the horizons of our understanding by enabling the use of both 'eyes' and 'tongue.' Its advancements inspire us to envision a future where AI can truly comprehend the world.
This article will explore CLIP's inner workings and pioneering potential in multimodal learning.
What is OpenAI CLIP?
OpenAI introduced CLIP in 2021. This model focuses on learning visual concepts through natural language supervision. Computer vision tasks are typically supervised, and the training data constrain the model’s performance. OpenAI's team discovered that pre-training a model with raw text descriptions of images enables it to excel across a broader range of vision tasks directly out of the box.
CLIP stands for Contrastive Language Image Pretraining. It was pre-trained on 400 million text-image pairs collected from the internet. The model uses separate linear classifiers to caption image embeddings. CLIP simultaneously contrasts both modalities by training a text and image encoder, mapping image and text data to a shared space. This approach marked a pioneering evolution in sophisticated multi-modal systems we see today, such as LlAVA, GPT-4 vision, and others.
The extensive pre-training on text-image pairs and the integration of text and image modalities in a single model enable CLIP to recognize unseen labels. It can create connections between images and text it has never seen before. This allows CLIP to display amazing performance in diverse scenarios.
Contrastive pre-training
Zero-Shot Learning with CLIP
Being Zero-Shot means CLIP can recognize new objects and generate new connections without any training on that specific example. Popular image generation models like DALL-E and Stable Diffusion use CLIP for encoding image understanding in their architectures.
Image Generation Model Architecture
Older SOTA image classification models’ capabilities were limited to the datasets they were trained on; for instance, ImageNet models’ zero-shot capabilities are limited to just classifying 1000 classes it was trained on.
If one wanted to perform any other vision task, they had to attach a new head to the ImageNet model, curate a labeled dataset, and finetune the model. But CLIP can be used off-the-shelf for a variety of vision tasks without the need for any finetuning or labeled data.
CLIP is versatile enough to handle numerous visual classification tasks without the requirement for extra training data. To utilize CLIP for a different task, one simply needs to inform its text encoder about the visual concepts related to the task. Consequently, CLIP generates a linear classifier based on its visual representations. Remarkably, the precision of this classifier frequently rivals that of models trained with full supervision.
Unsplash uses CLIP to label their image. We can see a demo of the remarkable zero-shot prediction capability of CLIP on a few random samples from different datasets.
Image Labeling
Efficient indexing of CLIP Embeddings
If we use CLIP for zero-shot tasks on data with many target classes or large groups of images, doing it manually could exhaust computing resources and time. On the other hand, we can efficiently index the clip embeddings using a vector database. For instance, consider labeling an extensive collection of images on UnSplash by categories. We can save the computed image embeddings in an efficient and diverse vector store like Zilliz! Find the top k of each category label's most similar image vectors.
Using a vector store for massive zero-shot image labeling is just one of the few use cases in which vector stores like Zilliz can efficiently help us fully leverage the true potential of strong multimodal models like CLIP. We can extend the utility of this combo for many other use cases, like semantic search, unsupervised data exploration, and more!
Implementing CLIP: A guide
Let’s perform zero-shot image classification using a pre-trained CLIP model from HugginFace. The HF hub hosts quite a few options for various pre-trained CLIP model variants. We will use <openai-/clip-vit-bse-patch32> model and will perform image classification using the transformers library on some sampler from the MS-COCO dataset.
Step 1. OpenAI CLIP ViT-Base Model can be loaded by using libraries like Transformers from Hugging Face by the following code:
from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from\_pretrained("openai/clip-vit-base-patch32") processor = CLIPProcessor.from\_pretrained("openai/clip-vit-base-patch32")
Step 2. Now, you can retrieve a couple of images from COCO dataset by running the following piece of code:
from PIL import Image import requests image\_1 = Image.open(requests.get("http\://images.cocodataset.org/val2014/COCO\_val2014\_000000159977.jpg",stream=True).raw) image\_2 = Image.open(requests.get('http\://images.cocodataset.org/val2014/COCO\_val2014\_000000555472.jpg',stream=True).raw) images = \[image\_1,image\_2]
Step 3. To visualize the images as an output, below is the code snippet with a visual output:
import numpy as np import cv2 from google.colab.patches import cv2\_imshow def visualize(image): image\_arr = np.array(image) image\_arr = cv2.cvtColor(image\_arr, cv2.COLOR\_BGR2RGB) cv2\_imshow(image\_arr) for img in images: visualize(img)
Step 4. At this step, you can perform zero-shot inference on the images using CLIP by running the simple code below:
classes = \['giraffe', 'zebra', 'elephant'] inputs = processor(text=classes, images=images, return\_tensors="pt", padding=True) outputs = model(\*\*inputs) logits\_per\_image = outputs.logits\_per\_image probs = logits\_per\_image.softmax(dim=1)
Step 5. Finally, to get segregation on the images, you can simply visualize the results for both of the images in terms of graphs; the following code will produce an output of images with graphs:
# Visualize results for both images using separate graphs import matplotlib.pyplot as plt # Set the font family and size plt.rc('font', family='serif', size=12) # Loop through both images and their probabilities for i in range(2): # Create a new figure plt.figure(figsize=(12, 4)) # Create two subplots in the figure ax1 = plt.subplot(1, 2, 1) ax2 = plt.subplot(1, 2, 2) # Display the image in the first subplot ax1.imshow(images\[i]) ax1.axis('off') ax1.set\_title(f"Image {i+1}") # Create a bar plot of the probabilities in the second subplot ax2.bar(classes, probs\[i].detach().numpy(), color="navy") ax2.set\_xlabel('Class') ax2.set\_ylabel('Probability') ax2.set\_title(f"Probabilities for Image {i+1}") ax2.grid(True) # Show the plot plt.show()
Conclusion
OpenAI’s CLIP transforms the power of multimodal AI learning. By pioneering the concept of zero-shot learning and optimizing text-to-image retrieval, CLIP has changed the perspective of how machines learn and interact with the world.
CLIP’s pre-training during text-to-image retrieval enables various image classification scenarios. It is also customizable as needed, not restricting it to just one task. These features make CLIP a highly promising tool across a range of domains, including image search, medical diagnostics, and e-commerce, to name a few.
CLIP stands out from other models for its exciting ability to perform new tasks without extensive or prior learning. This led to exceptional and groundbreaking advancements in AI, paving the way for innovation, e.g., image generation models (stable diffusion series) that we witness today.