What is a Convolutional Neural Network? An Engineer's Guide
A Convolutional Neural Network (CNN) is a deep-learning model tailored for visual data like images, videos, and sometimes even audio files.
CNNs have transformed fields such as computer vision, image processing, object detection, and even natural language processing (NLP).
Traditional neural networks like MLP (Multi-Layer Perceptron) or Fully Connected Networks treat image data as flat vectors, which can be limiting when dealing with the spatial information present in visual data. This can lead to poor accuracy due to wrong assumptions (inductive bias).
CNNs address these issues by preserving image structure, such as local connectivity and content of the pixels of the image data, making them efficient at pattern recognition.
This post highlights CNN advantages, explains its architecture, and gives a simple example of designing a CNN model.
Key reasons for using a CNN
CNNs excel at extracting meaningful features from raw visual data, outperforming traditional neural networks. Reasons for using a CNN include:
Parameter sharing—A CNN shares the same set of parameters across different regions of input, which is helpful in efficiently identifying the hidden patterns in high-dimensional data.
Reduced number of parameters—CNNs use the technique of pooling and convolution, which significantly reduces the number of parameters compared with fully connected networks.
Hierarchical feature learning—A CNN mimics the hierarchical structure of the human visual system.
State-of-the-art performance—CNNs consistently outperform traditional neural networks in tasks like object detection, image processing, speech recognition, and image segmentation. Note that recent advances in computer vision have introduced convolutional and non-convolutional Transformers as well.
CNN architecture and how it works
A CNN has great capabilities, thus empowering these networks to find hidden patterns and decipher visual data with exceptional accuracy.
The human neural system has several layers, and each one is responsible for performing a unique function. CNNs have a similar architecture, with each layer extracting different features from the input image. Below is a detailed explanation of all the layers involved in CNN architecture.
The first few layers are convolution layers, which are responsible for extracting the basic features of the image such as edges and shape.
The next few layers are pooling layers, which are responsible for reducing the size of feature maps.
Finally, the last layer is the fully connected (FC) layer, which is responsible for classifying the image into one of the given categories.
Nearly all modern, pure convolutional architectures have just one global pooling layer at the end followed by one fully connected layer.
Convolution layer
This layer's purpose is to find some distinctive patterns. It takes the input image and applies a set of filters to it to produce an output.
The filter is a small matrix of weights that scans the input image and identifies different patterns.
The output from convolution is called a feature map, which contains all the features extracted by a filter in the convolution operation. It extracts features like shape, size, edges, texture, etc.
Pooling layer
As architecture grows after convolution, you can downsample it by controlling the stride of the convolution across the image, but a slightly better way is to use the pooling layer to reduce the size of feature maps while preserving the most important features.
It helps to reduce the computational complexity of a CNN and avoid the overfitting problem that may arise. Techniques like max pooling and average pooling reduce the spatial dimension and prevent the network from getting overwhelmed.
Fully connected layer
The final layer of a CNN is a fully connected layer that classifies a CNN's output.
It's similar to a traditional neural network layer in that it accepts the output from the previous layer and connects it to a set of neurons. The neurons are part of a fully connected layer that classifies the image into one of the desired classes.
Essential terminology
In the complete working of CNN architecture, it's important to understand a few common terms. These terms may come into play when you have a complex problem statement or, if the data is very large, to prevent the overfitting problem.
Strides—This is known as a step size that the filter takes during convolution operation.
Padding—Padding in CNN is adding zeros around the borders of the image to preserve its spatial dimension after convolution. It's done to prevent the image from shrinking and to prevent the loss of information after each convolution operation.
Epoch—One complete pass through the entire training dataset.
Dropout (regularization)—Technique to prevent overfitting by randomly dropping neurons during training, which forces the network to learn rather than rely on more neurons.
Stochastic Depth—Shortens the network during training by dropping the residual blocks randomly and bypassing their transformations through skip connections. Meanwhile, at the testing time, the whole network is used to make predictions. This results in improved test error and significantly reduced training time.
How to design a Convolution Neural Network
Here's a complete overview of how to design a CNN for any image classification problem statement.
Choose the input size—Input size represents the size of an image on which the CNN will be trained. The input size should be large enough so the network is capable of extracting the features of an object that it aims to classify.
Choose the number of convolution layers—This determines how many features the network will be able to learn. More convolution layers allow it to learn more complex features, but the computation time increases.
Choose the size of the filter—The size of the filter, along with the stride of the convolution, determines the size of the features that will be extracted from images. A larger dimension filter will extract a higher number of features.
Choose the number of filters per layer—This determines the number of different features that can be extracted from an image.
Choose the pooling method—The two common pooling techniques are max pooling and average pooling. Max pooling takes the maximum value from a small region of the feature map, while average pooling takes the average value from a small region of the feature map.
Choose the number of fully connected layers—This determines the number of classes the network can classify.
Choose the activation function—The activation function enables the learning of more complex patterns from the image dataset. For binary classification, it's normal to use the sigmoid function. In a multi-class classification problem statement, the FC layer uses the softmax activation function. To introduce the nonlinearity in data, people mostly use the GeLU or Swish activation functions these days.
Below is a simple example of CNN implementation with Python that classifies traffic signs. Find the dataset on the Kaggle website.
Simple CNN implementation with PyTorch
To implement a CNN model in Python, use frameworks such as PyTorch, TensorFlow, Keras, etc. These frameworks provide the implementation of all the layers required for a CNN.
The process starts with importing the necessary modules as follows:
# dependencies for computation
import pandas as pd
import numpy as np
# dependencies for reading and displaying images
from cv2 import resize
from skimage.io import imread
import matplotlib.pyplot as plt
%matplotlib inline
# dependency to create validation set
from sklearn.model_selection import train_test_split
# dependency to evaluate the model
from sklearn.metrics import accuracy_score
from tqdm import tqdm
# PyTorch libraries and modules
import torch
from torch.autograd import Variable
from torch.nn import (Linear, ReLU, CrossEntropyLoss,
Sequential, Conv2d, MaxPool2d, Module,
Softmax, BatchNorm2d, Dropout)
from torch.optim import Adam, SGD
Once that's done, load the dataset and the images with the following code:
# loading dataset
train = pd.read_csv('Data/train.csv')
# loading training images
train_img = []
for img_name in tqdm(train['Path']):
# defining the image path
image_path = 'Data/' + str(img_name)
# reading the image
img = imread(image_path, as_gray=True)
# resize image
img = resize(img, (28, 28))
# normalizing the pixel values
img /= 255.0
# converting the type of pixel to float 32
img = img.astype('float32')
# feed the image into the list
train_img.append(img)
# converting the list to numpy array
train_x = np.array(train_img)
# defining the target
train_y = train['ClassId'].values
train_x.shape
Once the training data is loaded, you’ll need to create a training and validation dataset using the train_test_split() method from sklearn.
# create validation set
train_x, val_x, train_y, val_y = train_test_split(train_x, train_y, test_size = 0.1)
# Check the shapes of the training and validation sets
(train_x.shape, train_y.shape), (val_x.shape, val_y.shape)
You’ll also need to reshape the data for the Torch model as follows:
# converting training images into torch format
train_x = train_x.reshape(-1, 1, 28, 28)
train_x = torch.from_numpy(train_x)
# converting the target into torch format
train_y = train_y.astype(int);
train_y = torch.from_numpy(train_y)
# converting validation images into torch format
val_x = val_x.reshape(-1, 1, 28, 28)
val_x = torch.from_numpy(val_x)
# converting the target into torch format
val_y = val_y.astype(int);
val_y = torch.from_numpy(val_y)
Then define different layers of a CNN as follows:
class Net(Module):
def __init__(self):
super(Net, self).__init__()
self.cnn_layers = Sequential(
# Defining a 2D convolution layer
Conv2d(1, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
# Defining another 2D convolution layer
Conv2d(4, 4, kernel_size=3, stride=1, padding=1),
BatchNorm2d(4),
ReLU(inplace=True),
MaxPool2d(kernel_size=2, stride=2),
)
# final dense layer for prediction
self.linear_layers = Sequential(
Linear(4 * 7 * 7, 43)
)
# Defining the forward pass
def forward(self, x):
x = self.cnn_layers(x)
x = x.view(x.size(0), -1)
x = self.linear_layers(x)
return x
The CNN network above has two convolution layers followed by a maximum pooling layer of 2-by-2.
A flattening layer can help classify the image of the sign into respective classes.
Next, let’s decide on the optimizer and the loss function and define the training procedure.
# defining the model
model = Net()
# defining the optimizer
optimizer = Adam(model.parameters(), lr=0.07)
# defining the loss function
criterion = CrossEntropyLoss()
# checking if GPU is available
if torch.cuda.is_available():
model = model.cuda()
criterion = criterion.cuda()
print(model)
def train(epoch):
model.train()
tr_loss = 0
# getting the training set
x_train, y_train = Variable(train_x), Variable(train_y)
# getting the validation set
x_val, y_val = Variable(val_x), Variable(val_y)
# converting the data into GPU format
if torch.cuda.is_available():
x_train = x_train.cuda()
y_train = y_train.cuda()
x_val = x_val.cuda()
y_val = y_val.cuda()
# clear Gradients of the model parameters
optimizer.zero_grad()
# prediction for train and validation set
output_train = model(x_train)
output_val = model(x_val)
# compute train and validation loss
loss_train = criterion(output_train, y_train)
loss_val = criterion(output_val, y_val)
train_losses.append(loss_train)
val_losses.append(loss_val)
# backpropagation and update model parameters
loss_train.backward()
optimizer.step()
tr_loss = loss_train.item()
if epoch%2 == 0:
# printing the validation loss
print('Epoch : ',epoch+1, '\t', 'loss :', loss_val)
Finally, train the model for 25 epochs on the training data as follows:
# defining the number of epochs
n_epochs = 25
# empty list to store training losses
train_losses = []
# empty list to store validation losses
val_losses = []
# training the model
for epoch in range(n_epochs):
train(epoch)
In the end, each model will be there to make predictions on the testing data. To learn more details, refer to this blog for how to write CNNs from Scratch in PyTorch.
FAQs
What is the difference between CNN and Deep Neural Networks?
A CNN is a type of neural network that can process visual data like images, speech, video, etc., while deep neural networks (DNNs) are a type of artificial neural network that can learn complex patterns from data.
Below are the key differences between CNNs and DNNs.
A CNN has a specific architecture for processing images. On the other hand, a DNN doesn't have any specific architecture and can work for a variety of tasks.
A CNN learns features from images by using convolution layers, while a DNN learns features with the help of different types of layers.
A CNN is more difficult to train, requires more data, and is computationally expensive compared with a DNN.
What are the three layers of a CNN?
The three layers of a CNN are the convolution layer, pooling layer, and fully connected layer.
Convolution layer—This layer is responsible for extracting features from images. It works by scanning images with a filter, which is a small matrix of weights. The filter moves across the image, and weights are multiplied by the values of pixels in the image. Finally, it produces a feature map that contains the extracted features.
Pooling layer—The pooling layer reduces the size of feature maps. To do this, two common pooling techniques are max pooling and average pooling.
Fully connected layer—This is the same as traditional neural networks that classify the output of CNN. The neurons in the fully connected layers then classify the image into a set of classes.
What is a Convolutional Neural Network in deep learning?
A convolutional neural network is a type of deep neural network that processes images, speeches, and videos so that you can use them to make real-world predictions on structured/unstructured data in the growing digital world.
A CNN helps predict human emotion, behavior, interests, likes, dislikes, etc., easily and efficiently.