What is a Convolutional Neural Network? An Engineer's Guide
A Convolutional Neural Network (CNN) is a deep-learning model tailored for visual data like images, videos, and sometimes even audio files.
Traditional neural networks like MLP (Multi-Layer Perceptron) or Fully Connected Networks treat image data as flat vectors, which can be limiting when dealing with the spatial information present in visual data. This can lead to poor accuracy due to wrong assumptions (inductive bias).
CNNs address these issues by preserving image structure, such as local connectivity and content of the pixels of the image data, making them efficient at pattern recognition.
This post highlights CNN advantages, explains its architecture, and gives a simple example of designing a CNN model.
Key reasons for using a CNN
CNNs excel at extracting meaningful features from raw visual data, outperforming traditional neural networks. Reasons for using a CNN include:
Parameter sharing—A CNN shares the same set of parameters across different regions of input, which is helpful in efficiently identifying the hidden patterns in high-dimensional data.
Reduced number of parameters—CNNs use the technique of pooling and convolution, which significantly reduces the number of parameters compared with fully connected networks.
Hierarchical feature learning—A CNN mimics the hierarchical structure of the human visual system.
State-of-the-art performance—CNNs consistently outperform traditional neural networks in tasks like object detection, image processing, speech recognition, and image segmentation. Note that recent advances in computer vision have introduced convolutional and non-convolutional Transformers as well.
CNN architecture and how it works
A CNN has great capabilities, thus empowering these networks to find hidden patterns and decipher visual data with exceptional accuracy.
The human neural system has several layers, and each one is responsible for performing a unique function. CNNs have a similar architecture, with each layer extracting different features from the input image. Below is a detailed explanation of all the layers involved in CNN architecture.
The first few layers are convolution layers, which are responsible for extracting the basic features of the image such as edges and shape.
The next few layers are pooling layers, which are responsible for reducing the size of feature maps.
Finally, the last layer is the fully connected (FC) layer, which is responsible for classifying the image into one of the given categories.
Nearly all modern, pure convolutional architectures have just one global pooling layer at the end followed by one fully connected layer.
This layer's purpose is to find some distinctive patterns. It takes the input image and applies a set of filters to it to produce an output.
The filter is a small matrix of weights that scans the input image and identifies different patterns.
The output from convolution is called a feature map, which contains all the features extracted by a filter in the convolution operation. It extracts features like shape, size, edges, texture, etc.
As architecture grows after convolution, you can downsample it by controlling the stride of the convolution across the image, but a slightly better way is to use the pooling layer to reduce the size of feature maps while preserving the most important features.
It helps to reduce the computational complexity of a CNN and avoid the overfitting problem that may arise. Techniques like max pooling and average pooling reduce the spatial dimension and prevent the network from getting overwhelmed.
Fully connected layer
The final layer of a CNN is a fully connected layer that classifies a CNN's output.
It's similar to a traditional neural network layer in that it accepts the output from the previous layer and connects it to a set of neurons. The neurons are part of a fully connected layer that classifies the image into one of the desired classes.
In the complete working of CNN architecture, it's important to understand a few common terms. These terms may come into play when you have a complex problem statement or, if the data is very large, to prevent the overfitting problem.
Strides—This is known as a step size that the filter takes during convolution operation.
Padding—Padding in CNN is adding zeros around the borders of the image to preserve its spatial dimension after convolution. It's done to prevent the image from shrinking and to prevent the loss of information after each convolution operation.
Epoch—One complete pass through the entire training dataset.
Dropout (regularization)—Technique to prevent overfitting by randomly dropping neurons during training, which forces the network to learn rather than rely on more neurons.
Stochastic Depth—Shortens the network during training by dropping the residual blocks randomly and bypassing their transformations through skip connections. Meanwhile, at the testing time, the whole network is used to make predictions. This results in improved test error and significantly reduced training time.
How to design a Convolution Neural Network
Here's a complete overview of how to design a CNN for any image classification problem statement.
Choose the input size—Input size represents the size of an image on which the CNN will be trained. The input size should be large enough so the network is capable of extracting the features of an object that it aims to classify.
Choose the number of convolution layers—This determines how many features the network will be able to learn. More convolution layers allow it to learn more complex features, but the computation time increases.
Choose the size of the filter—The size of the filter, along with the stride of the convolution, determines the size of the features that will be extracted from images. A larger dimension filter will extract a higher number of features.
Choose the number of filters per layer—This determines the number of different features that can be extracted from an image.
Choose the pooling method—The two common pooling techniques are max pooling and average pooling. Max pooling takes the maximum value from a small region of the feature map, while average pooling takes the average value from a small region of the feature map.
Choose the number of fully connected layers—This determines the number of classes the network can classify.
Choose the activation function—The activation function enables the learning of more complex patterns from the image dataset. For binary classification, it's normal to use the sigmoid function. In a multi-class classification problem statement, the FC layer uses the softmax activation function. To introduce the nonlinearity in data, people mostly use the GeLU or Swish activation functions these days.
Below is a simple example of CNN implementation with Python that classifies traffic signs. Find the dataset for the same here.
Simple CNN implementation with PyTorch
The process starts with importing the necessary modules as follows:
import pandas as pd import numpy as np from cv2 import resize from skimage.io import imread import matplotlib.pyplot as plt %matplotlib inline from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score from tqdm import tqdm import torch from torch.autograd import Variable from torch.nn import (Linear, ReLU, CrossEntropyLoss, Sequential, Conv2d, MaxPool2d, Module, Softmax, BatchNorm2d, Dropout) from torch.optim import Adam, SGD
Once that's done, load the dataset and the images with the following code:
# loading dataset train = pd.read_csv('Data/train.csv') # loading training images train_img =  for img_name in tqdm(train['Path']): # defining the image path image_path = 'Data/' + str(img_name) # reading the image img = imread(image_path, as_gray=True) # resize image img = resize(img, (28, 28)) # normalizing the pixel values img /= 255.0 # converting the type of pixel to float 32 img = img.astype('float32') # feed the image into the list train_img.append(img) # converting the list to numpy array train_x = np.array(train_img) # defining the target train_y = train['ClassId'].values train_x.shape
Once the training data is loaded, you’ll need to create a training and validation dataset using the train_test_split() method from sklearn.
# create validation set train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size = 0.1) # Check the shapes of the training and validation sets (train_x.shape, train_y.shape), (val_x.shape, val_y.shape)
You’ll also need to reshape the data for the Torch model as follows:
# converting training images into torch format train_x = train_x.reshape(-1, 1, 28, 28) train_x = torch.from_numpy(train_x) # converting the target into torch format train_y = train_y.astype(int); train_y = torch.from_numpy(train_y) # converting validation images into torch format val_x = val_x.reshape(-1, 1, 28, 28) val_x = torch.from_numpy(val_x) # converting the target into torch format val_y = val_y.astype(int); val_y = torch.from_numpy(val_y)
Then define different layers of a CNN as follows:
class Net(Module): def __init__(self): super(Net, self).__init__() self.cnn_layers = Sequential( # Defining a 2D convolution layer Conv2d(1, 4, kernel_size=3, stride=1, padding=1), BatchNorm2d(4), ReLU(inplace=True), MaxPool2d(kernel_size=2, stride=2), # Defining another 2D convolution layer Conv2d(4, 4, kernel_size=3, stride=1, padding=1), BatchNorm2d(4), ReLU(inplace=True), MaxPool2d(kernel_size=2, stride=2), ) # final dense layer for prediction self.linear_layers = Sequential( Linear(4 * 7 * 7, 43) ) # Defining the forward pass def forward(self, x): x = self.cnn_layers(x) x = x.view(x.size(0), -1) x = self.linear_layers(x) return x
The CNN network above has two convolution layers followed by a maximum pooling layer of 2-by-2.
A flattening layer can help classify the image of the sign into respective classes.
Next, let’s decide on the optimizer and the loss function and define the training procedure.
# defining the model model = Net() # defining the optimizer optimizer = Adam(model.parameters(), lr=0.07) # defining the loss function criterion = CrossEntropyLoss() # checking if GPU is available if torch.cuda.is_available(): model = model.cuda() criterion = criterion.cuda() print(model) def train(epoch): model.train() tr_loss = 0 # getting the training set x_train, y_train = Variable(train_x), Variable(train_y) # getting the validation set x_val, y_val = Variable(val_x), Variable(val_y) # converting the data into GPU format if torch.cuda.is_available(): x_train = x_train.cuda() y_train = y_train.cuda() x_val = x_val.cuda() y_val = y_val.cuda() # clear Gradients of the model parameters optimizer.zero_grad() # prediction for train and validation set output_train = model(x_train) output_val = model(x_val) # compute train and validation loss loss_train = criterion(output_train, y_train) loss_val = criterion(output_val, y_val) train_losses.append(loss_train) val_losses.append(loss_val) # backpropagation and update model parameters loss_train.backward() optimizer.step() tr_loss = loss_train.item() if epoch%2 == 0: # printing the validation loss print('Epoch : ',epoch+1, '\t', 'loss :', loss_val)
Finally, train the model for 25 epochs on the training data as follows:
# defining the number of epochs n_epochs = 25 # empty list to store training losses train_losses = # empty list to store validation losses val_losses = # training the model for epoch in range(n_epochs): train(epoch)
In the end, each model will be there to make predictions on the testing data. To learn more about implementing CNNs from scratch, refer to this article.
What is the difference between CNN and Deep Neural Networks?
A CNN is a type of neural network that can process visual data like images, speech, video, etc., while deep neural networks (DNNs) are a type of artificial neural network that can learn complex patterns from data.
Below are the key differences between CNNs and DNNs.
A CNN has a specific architecture for processing images. On the other hand, a DNN doesn't have any specific architecture and can work for a variety of tasks.
A CNN learns features from images by using convolution layers, while a DNN learns features with the help of different types of layers.
A CNN is more difficult to train, requires more data, and is computationally expensive compared with a DNN.
What are the three layers of a CNN?
The three layers of a CNN are the convolution layer, pooling layer, and fully connected layer.
Convolution layer—This layer is responsible for extracting features from images. It works by scanning images with a filter, which is a small matrix of weights. The filter moves across the image, and weights are multiplied by the values of pixels in the image. Finally, it produces a feature map that contains the extracted features.
Pooling layer—The pooling layer reduces the size of feature maps. To do this, two common pooling techniques are max pooling and average pooling.
Fully connected layer—This is the same as traditional neural networks that classify the output of CNN. The neurons in the fully connected layers then classify the image into a set of classes.
What is a Convolutional Neural Network in deep learning?
A convolutional neural network is a type of deep neural network that processes images, speeches, and videos so that you can use them to make real-world predictions on structured/unstructured data in the growing digital world.
A CNN helps predict human emotion, behavior, interests, likes, dislikes, etc., easily and efficiently.