Back

# How to Make Online Shopping More Intelligent with Image Similarity Search?

By Jessi Ji on Dec 21, 2021

Due to the coronavirus pandemic, people increasingly sought to socially distance themselves and avoid contact with strangers. As a result, no-contact delivery became an incredibly desirable option for many consumers. Amazon, the leading online retailer, reported that its net sales increased from $96.1 billion to$110.8 billion in the third quarter of 2021, a 15% year-over-year (YoY) growth. The rising popularity of e-commerce has also led to people buying a greater variety of goods online, including niche items that can be hard to describe using a traditional keyword search.

To help users overcome the limitations of keyword-based queries, companies can build image search engines that allow users to use images instead of words for search. Not only does this allow users to find items that are difficult to describe, but it also helps them shop for things they encounter in real life. This functionality helps build a unique user experience and offers general convenience that customers appreciate.

This article explores how to use AI models and a vector database to build an image search engine for online shopping platforms.

## How does image search work?

An image search system searches in the database inventory for product images that are similar to user uploads. The following chart shows the two stages of the system process, the data import stage (the right column) and the query stage (the left column):

2. Extract feature vectors from the detected targets. This article uses ResNet.
3. Conduct vector similarity search in a vector database. This article used Milvus, the open-source vector database.

System process of an image similarity search system for e-commerce.

### Target detection using the YOLO model

This article uses YOLO (You only look once) to detect objects in user uploaded images. YOLO is a one-stage model, using only one convolutional neural network (CNN) to predict categories and positions of different targets. It is small, compact, and well-suited for mobile use.

YOLO uses convolutional layers to extract features and fully-connected layers to obtain predicted values. Drawing inspiration from the GooLeNet model, YOLO’s CNN includes 24 convolutional layers and two fully-connected layers.

As the following illustration shows, a 448 × 448 input image is converted by a number of convolutional layers and pooling layers to a 7 × 7 × 1024-dimensional tensor (depicted in the third to last cube below), and then converted by two fully-connected layers to a 7 × 7 × 30-dimensional tensor output.

The predicted output of YOLO P is a two-dimensional tensor, whose shape is [batch,7 ×7 ×30]. Using slicing, P[:,0:7×7×20] is the category probability, P[:,7×7×20:7×7×(20+2)] is the confidence, and P[:,7×7×(20+2)]:] is the predicted result of the bounding box.

The YOLO network architecture.

### Image feature vector extraction with ResNet

After the images go through YOLO, they need to be converted into vectors for the system to process. This article adopts the residual neural network (ResNet) model to extract feature vectors from an extensive product image library and user uploaded photos. ResNet is limited because as the depth of a learning network increases, the accuracy of the network decreases. The image below depicts ResNet running the VGG19 model (a variant of the VGG model) modified to include a residual unit through the short circuit mechanism. VGG was proposed in 2014 and includes just 14 layers, while ResNet came out a year later and can have up to 152.

The ResNet structure.

The ResNet structure is easy to modify and scale. By changing the number of channels in the block and the number of stacked blocks, the width and depth of the network can be easily adjusted to obtain networks with different expressive capabilities. This effectively solves the network degeneration effect, where accuracy declines as the depth of learning increases. With sufficient training data, a model with improving expressive performance can be obtained while gradually deepening the network. Through model training, features are extracted for each picture and converted to 256-dimensional floating point vectors.