January 5, 2022
Link copied
This article is transcreated by Angela Ni.
Deduplication refers to eliminating redundant data in the database. This is not a new idea in the industry. However, it seems a necessary solution to managing today's explosive data growth, especially the growth of unstructured data like videos.
Online video streaming platforms such as Youtube store massive volumes of archive videos. It is very likely that there are many duplicate video resources in the archive. But the data redundancy brings two problems:
Therefore, to manage data more efficiently and optimize user experience, a video deduplication system is necessary for online video platforms.
This tutorial aims to explain how to build an intelligent video deduplication system using a vector database designed for vector management and vector similarity search.
Generally, when conducting video similarity search, the system extracts several key frames of a video clip and converts these key frames into multiple feature vectors, using deep learning models. Then here comes the problem. How do we calculate the similarity between multiple feature vectors? One common practice is to combine all the feature vectors and generate one single vector for the next stage - vector similarity search. However, for those small and medium enterprises whose resources are limited, there is another alternative. Another solution is to treat a video clip as an aggregation of multiple images. The video deduplication system extracts a certain number of key frames (images) from a video clip to create an image set. In this way, a video clip is represented by an image set. The video deduplication system defines the similarity between two videos as the similarity of two image sets, and uses Milvus, an open-source vector database, to calculate the similarity. Since in this tutorial, we intend to build an MVP deduplication system, we will use the second way instead of combining the feature vectors.
The following section describes the key steps of building an intelligent video deduplication system in detail.
ffprobe is a tool specifically designed for analyzing multi-media assets. It extracts information from streaming videos or audios and prints the information in a human- and machine-readable fashion. ffprobe can also be used to detect the format of the container used by a multimedia stream as well as the format and type of each multi-media asset. Use the following command to obtain information about the duration of a video.
ffprobe -show_format -print_format json -v quiet input.mp4
FFmpeg is an open-source tool for audio and video analysis that supports recording and converting audio and video clips into multiple formats. We use FFmpeg to extract a certain amount of frames in the video at a certain interval. For instance, if a video clip lasts for 100 seconds and we extract the frames every 10 seconds, the extraction ratio is 0.1. Run the following command to extract the video frames.
ffmpeg -i input.mp4 -r 0.1 ./images/frames_%02d.jpg
Pre-processing of the images includes cropping, enlarging, and reducing images, as well as adjusting image gray scale, and more. After pre-processing the images, use the pre-trained deep learning model VGG or ResNet to convert the images into 1000-dimensional feature vectors.
# Use the nativeImageLoader to convert to numerical matrix
File f=new File(absolutePath, "drawn_image.jpg");
NativeImageLoader loader = new NativeImageLoader(height, width, channels);
# put image into
INDArrayINDArray image = loader.asMatrix(f);
# values need to be scaled
DataNormalization scalar = new ImagePreProcessingScaler(0, 1);
# then call that scalar on the image dataset
scalar.transform(image);
# pass through neural net and store it in output array
output = model.output(image);
As mentioned above, video similarity is equivalent to the similarity of two image sets. However, we cannot calculate the similarity of two sets of vectors directly. Only the similarity of two single vectors is calculable. Therefore, we need to define the rules for calculating the similarity between two image sets:
To further clarify the rules, suppose we have two vector sets and would love to know their similarity.
First, we compare Vec_1 with Vec_11, Vec_12, ..., and Vec_20 respectively and calculate the inner product value. Then, we will obtain IP_1, IP_2, ..., and IP_10. The maximum value among them is the similarity between Vec_1 and Set 1, and we rename the maximum value as Max_1. For example, IP_3 is the largest among IP_1 to IP_10. This is to say, the similarity between Vec_1 and Set 1 equals IP_3 (the similairty between Vec_1 and Vec_13). We rename IP_3 as Max_1. Subsequently, we repeat the same process by comparing the rest of the vectors in Set 1 to each vector in Set 2. After this, we will obtain Max_1, Max_2, ..., and Max_10. The similarity of Set 1 and Set 2 equals the average mean of Max_1, Max_2, ..., and Max_10.
As a database specifically designed to handle queries over input vectors, Milvus is capable of indexing vectors on a trillion scale. Unlike existing relational databases which mainly deal with structured data following a pre-defined pattern, Milvus is designed from the bottom-up to handle embedding vectors converted from unstructured data.
As the Internet grew and evolved, unstructured data became more and more common, including emails, papers, IoT sensor data, Facebook photos, protein structures, and much more. In order for computers to understand and process unstructured data, these are converted into vectors using embedding techniques. Milvus stores and indexes these vectors. Milvus is able to analyze the correlation between two vectors by calculating their similarity distance. If the two embedding vectors are very similar, it means that the original data sources are similar as well.
In this tutorial, follow the steps below to conduct vector similarity search in Milvus:
This tutorial serves as a basic instruction on building a minimum viable product (MVP). However, more work can be done to further optimize the system: