From Pixels to Embeddings: How Video AI Represents Visual Data
Discover how video AI transforms raw footage into meaningful embeddings, enabling efficient scene search and action recognition. Explore the technology behind the magic.
Read the entire series
Introduction
Imagine trying to find the exact scene in a movie where the hero jumps off a building, or scanning hours of surveillance footage to detect when a specific car enters a parking lot. Manually combing through frames would be a slow, tedious, and error-prone process. This is where video AI steps in. But how does it make sense of millions of raw pixels across time? The secret lies in video embeddings, which are compact and meaningful representations of video content that enable machines to search, analyze, and understand video content the way humans do— albeit faster.
Video embeddings are the foundation of modern video AI systems. These are numerical vectors that summarize the content of a video clip, capturing essential visual and temporal information. By converting video into embeddings, AI models can efficiently perform tasks such as scene search, action recognition, and content recommendation at scale. In essence, embeddings transform a sea of pixels into a structured space where similar videos cluster together, making analysis fast, flexible, and intelligent.
In this blog, we will explore how raw video data, comprising millions of pixels across space and time, is transformed into vector embeddings that power applications ranging from search engines to intelligent surveillance. We will break down the core concepts, walk through the underlying architectures (2D CNNs, 3D CNNs, Transformers), and demonstrate how these embeddings are stored and used in real-world systems. Whether you're building a video AI application or simply curious about what's happening behind the scenes, this guide will walk you through the journey from pixels to embeddings.
Basics of Visual Data Representation
Understanding the Challenge of Video
Video data is not just a series of pictures. It is a dense stream of high-resolution frames, typically 30, 60, or even 120 frames per second (fps), each filled with color values for every single pixel. One minute of 1080p video can contain over 100,000 frames and hundreds of millions of pixel values. That is a massive amount of unstructured information.
What makes video even more complex than image data is time. While an image captures a single moment, video captures change: objects moving, actions unfolding, scenes evolving. AI systems need to process both what is in the frame and how things evolve across frames, a challenge that requires more than traditional image-based analysis.
Figure 1: Video frame rate (fps) (Source)
How AI Converts Pixels to Features
Before a model can understand video, it needs to convert the pixel-level data, represented by colors in arrays of RGB (Red, Green, Blue), values into a form it can reason with: numerical features that describe meaningful patterns in the video. This process is called feature extraction. Neural networks, especially convolutional layers in CNNs, play a key role here. For example:
A filter (also called a kernel) might slide across an image and react strongly where it finds an edge, like the boundary between a person's face and the background.
These reactions are stored as numbers in a feature map, which highlights where certain patterns (edges, corners, textures) appear.
Stacking many layers allows the network to combine simple features into more abstract concepts, like detecting a face, a car, or a jump.
Figure 2: Pixels represented by colors in arrays of RGB (Source)
Later layers can learn even more complex representations, like motion patterns, by combining frame-level features over time. These extracted features are what get transformed into embeddings, compact vectors that summarize visual content in a meaningful and searchable way.
2D Convolutional Neural Networks (CNNs) for Video Embeddings
CNNs for Image Embeddings
2D CNNs revolutionized image understanding. They work by scanning over an image with small filters that look for specific patterns, edges, corners, textures, or shapes. This operation is called a convolution, and each filter is designed to activate in response to a specific visual feature.
As the image passes through the network:
Early layers detect simple patterns like vertical or horizontal edges.
Mid-level layers combine those features to detect object parts or textures.
Deeper layers recognize complex structures like faces, animals, or scene layouts.
Between convolutional layers, CNNs often include pooling layers, which downsample the feature maps. This helps reduce computational load and provides spatial invariance, allowing features to be recognized regardless of their exact position in the image.
Figure 3: Max-pooling layer, where the maximum intensity value is selected (Source)
After a series of convolution and pooling layers, the resulting multi-dimensional feature maps are flattened, transformed from a 2D grid into a 1D vector.
Figure 4: Max-pooling layer, where the maximum intensity value is selected (Source)
But flattening is just a reshaping operation, not yet the embedding. The flattened vector is then passed through one or more fully connected (dense) layers, which mix and interpret the features. The output of the final fully connected layer before classification (often called the penultimate layer) is what we call the embedding, a compact, high-dimensional representation that captures the most important visual information about the image.
Figure 5: Typical CNN architecture (Source)
Applying CNNs to Video Frames
To apply 2D CNNs to video, the simplest strategy is to treat each frame as an independent image. Here is how you can extract frame-level embeddings using ResNet-50 in PyTorch.
Extract a frame from the video.
Pass it through a pretrained CNN such as ResNet, VGG, or EfficientNet.
Capture the output embedding for that frame.
Repeat for multiple frames across the video.
import torchvision.models as models
import torchvision.transforms as transforms
from PIL import Image
import torch
model = models.resnet50(pretrained=True)
model.train()
transform = transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor()
])
def extract_frame_embedding(frame_path):
img = Image.open(frame_path)
img_t = transform(img).unsqueeze(0)
with torch.no_grad():
features = model(img_t)
return features.squeeze().numpy()
# Output: 1000 dimensions array
array([-1.25593758e+00, -1.37986898e+00, -1.34654438e+00, -8.93157005e-01,...,..., 2.03743115e-01, -3.74650449e-01, -4.72319269e+00, -1.80252278e+00])
This gives you frame-level embeddings, each summarizing the content of a single moment in time. You can also extract embeddings from intermediate layers if you want more general visual features.
These embeddings can be used for tasks like scene classification or detecting specific objects in each frame. However, this approach has a limitation: 2D CNNs process spatial features only. They do not capture how things move or change from one frame to the next. That means:
A 2D CNN can recognize that someone is standing in frame 1 and on the ground in frame 10.
But it cannot understand that the person jumped, because it has no concept of time or motion between frames.
To understand actions, events, or motion, we need architectures that can model the temporal dimension, which leads us to 3D convolutions and transformers.
3D Convolutions for Temporal Embeddings
Introducing 3D Convolutions
While 2D convolutions are designed to capture spatial features (height and width), 3D convolutions extend this concept by adding a third dimension: time. This makes 3D convolutions especially useful for video analysis, as they can model not just the spatial structure of individual frames, but also the temporal relationships between frames.
In a 2D convolution, filters (kernels) move across the image in two directions: height and width. The resulting output captures the spatial features of the image. In a 3D convolution, however, the filter slides across three dimensions: height, width, and depth (where depth corresponds to time in the case of video). This allows the network to learn patterns that span both spatial and temporal domains, which is crucial for understanding motion, actions, and events in a video.
This is the key advantage of 3D convolutions: they can capture how objects move, interact, and change across multiple frames, enabling better recognition of dynamic events like actions and interactions.
3D CNN Architectures for Video
Several 3D convolutional architectures have been developed specifically to work with video data. These architectures leverage 3D convolutions to learn both spatial and temporal features from video clips. Let's look at two prominent models:
C3D (3D Convolutional Network)
C3D is one of the pioneering architectures for video classification. It uses 3D convolutions across consecutive video frames to learn both spatial and temporal features. C3D employs several 3D convolution layers, each followed by pooling and activation functions, to capture motion and dynamic changes across time.
Input: A sequence of video frames (e.g., a 16-frame segment).
Output: A vector that represents the content of the video segment, including both spatial and motion-based features.
Advantage: Efficiently learns temporal information from short video clips, making it suitable for action recognition tasks.
I3D (Inflated 3D ConvNet)
I3D, or Inflated 3D ConvNet, takes the concept of 3D convolutions further by inflating 2D convolution filters into 3D. This approach is based on the idea that pre-trained 2D CNN models (such as Inception-v1) can be adapted to process temporal data by simply inflating their 2D filters into 3D versions. This allows I3D to leverage large, pre-trained 2D models for spatial feature extraction, while adding temporal information through the 3D convolutions.
Input: A sequence of video frames, often of higher resolution or longer duration than C3D.
Output: A comprehensive representation that includes both complex spatial features and the temporal dynamics between frames.
Advantage: By using pre-trained 2D networks, I3D combines the benefits of large-scale image models and temporal analysis, which makes it particularly effective for large video datasets.
Figure 6: 3D CNN architecture comparison (Source)
Implementing a Basic 3D CNN in PyTorch
Let’s implement a simple 3D CNN that not only classifies videos but also allows us to extract intermediate representations of both spatial and temporal features. The below model consists of:
Two 3D convolutional layers with ReLU activations and max pooling
Three fully connected layers for final classification
Return values that give us access to intermediate and final representations
For video data, we use a 5D tensor with dimensions:
Batch size: Number of video clips (8)
Channels: 3 for RGB video
Depth: Number of frames in each video clip (16)
Height: Frame height (64)
Width: Frame width (64)
import torch
import torch.nn as nn
class Simple3DCNNWithIntermediateOutputs(nn.Module):
def __init__(self):
super(Simple3DCNNWithIntermediateOutputs, self).__init__()
# Define the 3D convolutional layers
self.conv1 = nn.Conv3d(in_channels=3, out_channels=64,
kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
self.pool1 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
self.conv2 = nn.Conv3d(in_channels=64, out_channels=128,
kernel_size=(3, 3, 3), stride=(1, 1, 1), padding=(1, 1, 1))
self.pool2 = nn.MaxPool3d(kernel_size=(1, 2, 2), stride=(1, 2, 2))
self.fc1 = nn.Linear(128 * 16 * 16 * 16, 1024) # Flattened size
self.fc2 = nn.Linear(1024, 256)
# Output layer for classification (10 classes)
self.fc3 = nn.Linear(256, 10)
def forward(self, x):
# Apply Conv1, ReLU, and MaxPooling
x1 = self.pool1(torch.relu(self.conv1(x)))
# Apply Conv2, ReLU, and MaxPooling
x2 = self.pool2(torch.relu(self.conv2(x1)))
# Flatten the tensor for the fully connected layers
# Keep the batch dimension separate
batch_size = x2.size(0)
x_flattened = x2.view(batch_size, -1)
# Fully connected layers
x_fc1 = torch.relu(self.fc1(x_flattened))
x_fc2 = torch.relu(self.fc2(x_fc1))
x_final = self.fc3(x_fc2) # Final output
# Returning final output and intermediate features
return x_final, x1, x2
# Create sample video data: 8 videos with 16 frames each (64×64 resolution)
input_data = torch.randn(8, 3, 16, 64, 64)
# Instantiate the model
model_with_intermediate = Simple3DCNNWithIntermediateOutputs()
# Pass the input data through the model
output, frame_features, temporal_features = model_with_intermediate(input_data)
# Print shapes of intermediate and final outputs
# Final classification output
print("Output shape:", output.shape)
# After first 3D Conv
print("Frame-level features shape:", frame_features.shape)
# After second 3D Conv
print("Temporal features shape:", temporal_features.shape)
# Output
Output shape: torch.Size([8, 10])
Frame-level features shape: torch.Size([8, 64, 16, 32, 32])
Temporal features shape: torch.Size([8, 128, 16, 16, 16])
The key insight of our architecture is that it allows us to access different levels of feature abstraction:
- Early Spatiotemporal Features (After First 3D Conv): These features represent a blend of spatial and early temporal information:
The model has applied 3×3×3 convolutions, which means each feature has already integrated information from 3 consecutive frames.
These features capture basic spatiotemporal patterns like moving edges, color transitions, and simple motions.
The pooling operation (1, 2, 2) reduces spatial dimensions while preserving temporal resolution, allowing the model to retain frame-by-frame information.
- Advanced Temporal Context (After Second 3D Conv): The second convolution layer builds upon the first, creating more complex representations with shape (8, 128, 16, 16, 16):
By operating on already temporally-aware features from the first layer, this layer captures longer-range dependencies between frames.
The second 3×3×3 convolution effectively sees a total of 5 original frames (expanding the temporal receptive field).
These features can encode more complex motion patterns like acceleration, direction changes, and action sequences.
Transformers for Video Embeddings
While CNNs and 3D CNNs have proven powerful for video understanding, transformer-based architectures are now redefining what's possible in spatiotemporal modeling. Originally developed for natural language processing (NLP), transformers have found success in vision tasks, and more recently, in video analysis.
How Transformers Revolutionized Video Analysis
Transformers have transformed the landscape of deep learning by introducing self-attention mechanisms that can model long-range dependencies in data. Their key strength lies in the ability to attend globally, meaning they can relate information across an entire sequence without relying on local operations like convolution or recurrence. In the context of video:
Frames are treated as sequences of image patches (or tokens).
Self-attention enables the model to capture relationships within a frame (spatial) and across frames (temporal).
This makes transformers well-suited for understanding both object appearance and motion over time.
Two standout models in this space are TimeSformer and X3D, both of which introduce novel mechanisms for encoding spatiotemporal information in video sequences.
TimeSformer: Video Attention Across Time
TimeSformer (Time-Space Transformer), released in 2021, brings transformer-style attention into the video domain by decoupling spatial and temporal attention, making it more computationally efficient and easier to train.
The model architecture consist of the following:
Divided Space-Time Attention: Instead of applying full 3D attention (which is expensive), TimeSformer first applies self-attention within each frame (spatial), then attends across frames (temporal). This makes the model scalable to longer video sequences.
Patch-based Input: Like Vision Transformers (ViT), TimeSformer splits each frame into fixed-size patches (e.g., 16×16), flattens them, and embeds them as tokens.
Positional Embedding: TimeSformer adds space-time positional embeddings to each patch to retain spatial and temporal ordering.
TimeSformer excels at modeling long-range dependencies in video thanks to its efficient division of spatial and temporal attention. Its modular, patch-based design allows for scalability and flexibility, while pretrained 2D weights from models like ViT make it easy to fine-tune for video-specific tasks. This makes TimeSformer a powerful and adaptable solution for high-level video understanding.
X3D: Multiscale Vision Transformers for Video
X3D (Expanded 3D Convolutional Network), released in 2020, presents a different approach by focusing on efficiency and scalability. While technically not a pure transformer, X3D incorporates principles that align well with modern transformer-based thinking, especially in multiscale representation learning.
The model architecture consist of the following:
Progressive Expansion: X3D starts from a lightweight 2D network and gradually expands it along:
Depth (more layers),
Width (more channels),
Temporal resolution (longer video clips),
Input resolution (larger frames).
Multiscale Temporal Modeling: Uses temporally-dilated convolutions and attention-like operations to handle actions occurring at different time scales.
Efficient Computation: Designed to outperform heavier 3D CNNs like SlowFast or C3D, while using fewer FLOPs and faster inference.
Figure 7: X3D networks progressively expand a 2D network across the following axes: Temporal duration γt, frame rate γτ, spatial resolution γs, width γw, bottleneck width γb, and depth γd. (Source)
Although it is based on convolutional layers, X3D adopts hierarchical feature learning across multiple temporal resolutions, similar to how transformers layer self-attention across depth. Its adaptability across different feature scales and temporal ranges makes it highly relevant in modern video AI.
X3D offers an elegant balance between performance and efficiency by progressively expanding a lightweight 2D model into a powerful 3D video encoder. Its multiscale temporal modeling enables it to capture actions at various time scales with fewer computational resources, making it ideal for real-time or edge deployments. As a hybrid model, X3D is both adaptable and complementary to transformer-based approaches.
Storing and Using Video Embeddings for Future Search/Analysis
As video models grow more powerful, so does the need to manage their outputs, particularly the high-dimensional embeddings that represent rich visual and temporal patterns. Whether you are building a video search engine or analyzing behaviors over time, storing and querying embeddings becomes crucial.
Why Vector Databases Matter
Traditional relational databases fall short when dealing with embeddings. These high-dimensional vectors, often hundreds or thousands of dimensions, cannot be efficiently indexed or queried using SQL alone. Searching for “similar” vectors requires specialized techniques like approximate nearest neighbor (ANN) search.
Vector databases, such as OSS Milvus and Zilliz Cloud (a managed solution), are purpose-built for this task. They enable:
Efficient indexing and storage for high-dimensional data
Fast similarity search over millions or even billions of vector embeddings
In this chapter, we will use Milvus, an open-source and highly scalable vector DB for storing video embeddings and running real-time similarity queries to explain how to store and retrieve video embeddings.
Storing Embeddings in Milvus
To store embeddings, we first extract them from our model and convert them from PyTorch tensors to NumPy arrays. Each vector should be associated with metadata. Below is an example of how to store embeddings using our previously created class (Simple3DCNNWithIntermediateOutputs) and some synthetic video inputs.
from pymilvus import MilvusClient
import numpy as np
import torch
# Define your model and simulate video input
model = Simple3DCNNWithIntermediateOutputs()
video_batch = torch.randn(3, 3, 16, 64, 64) # 3 video clips
output, _, _ = model(video_batch) # Run through the model
embeddings = output.detach().numpy() # Convert to numpy
embedding_dim = embeddings.shape[1] # Get the embedding dimension
# Create a Milvus Lite client (in-memory or file-backed)
client = MilvusClient("./video_embeddings.db")
# Create the collection
client.create_collection(
collection_name="video_clips",
dimension=embedding_dim
)
# Prepare metadata
data = [
{"id": i,
"vector": embeddings[i]}
for i in range(len(embeddings))
]
# Insert into Milvus Lite
insert_result = client.insert(
collection_name="video_clips",
data=data
)
Performing Search & Analysis
Once embeddings are stored, Milvus allows fast vector similarity search. This can be used to:
Find similar video clips
Group related scenes or actions
Power recommendation systems or visual search engines
Let’s perform a similarity search of a specific video.
# Pick first embedding of the embeddings as a query vector
query_vector = [embeddings[0]]
# Search for similar vectors
search_result = client.search(
collection_name="video_clips", # The collection to search in
data=query_vector, # The query vector
limit=2, # Limit the number of results to 2
output_fields=["id"] # Fields to retrieve for each result
)
# Display search results
print("Search result:", search_result)
# Output
Search result: data: ["[{'id': 0, 'distance': 1.0000, 'entity': {'id': 0}}, {'id': 1, 'distance': 0.9538, 'entity': {'id': 1}}]"]
The retrieved clips are the most similar to the query. We see that the first video retrieved is the one of the query, and the cosine similarity is 1. This means the first result is identical to the query, as in this case, it is the same video.
Conclusion
As video analysis models continue to advance, the ability to store and search through high-dimensional embeddings becomes a critical component of modern video processing pipelines. By leveraging vector databases like Milvus, we can efficiently handle large volumes of embeddings and perform fast similarity searches, unlocking powerful capabilities for applications such as video search engines, recommendation systems, and behavior analysis. The integration of deep learning models with scalable vector databases offers a streamlined solution for extracting and querying meaningful insights from complex video data.
In this blog, we explored how to use Milvus to store, query, and analyze embeddings extracted from video clips. The flexibility and performance of vector databases make them ideal for managing the vast amounts of data produced by video models, allowing for real-time searches and scalable analysis. As video-based AI applications continue to evolve, combining advanced deep learning techniques with robust storage solutions will be key to unlocking the full potential of video understanding and analysis.
- Introduction
- Basics of Visual Data Representation
- 2D Convolutional Neural Networks (CNNs) for Video Embeddings
- 3D Convolutions for Temporal Embeddings
- Transformers for Video Embeddings
- Storing and Using Video Embeddings for Future Search/Analysis
- Conclusion
Content
Start Free, Scale Easily
Try the fully-managed vector database built for your GenAI applications.
Try Zilliz Cloud for FreeKeep Reading

Batch vs. Layer Normalization - Unlocking Efficiency in Neural Networks
By unraveling the intricacies of layer and batch normalization, we aim to equip neural network beginners with the knowledge to unlock efficiency and enhance model performance.

TF-IDF - Understanding Term Frequency-Inverse Document Frequency in NLP
We explore the significance of Term Frequency-Inverse Document Frequency (TF-IDF) and its applications, particularly in enhancing the capabilities of vector databases like Milvus.

Florence: An Advanced Foundation Model for Computer Vision by Microsoft
Florence is a large-scale vision-language model developed by Microsoft, particularly effective for applications requiring multimodal capabilities.