# Convolutional neural networks (CNN)

## Introduction

Motivated by human vision and designed to be robust against invariant transformations such as scaling, translation, and rotations , convolutional neural networks (CNN) have become the de facto standard for computer vision tasks in the recent decade. In fact, much of the current deep learning revolution started with deep convolutional networks setting new state-of-the-art records on major image recognition tasks. That being said, CNNs or their variants are widely used in applications beyond computer vision.

## Prerequisites

To understand CNNs, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Motivation

Consider an example computer vision task of recognizing handwritten digits — given an image of a handwritten digit, the model needs to accurately recognize the digit in the image.

Suppose we try to address this challenge with a multilayer perceptron (MLP) model. A simple implementation will involve pixel-level inputs and 10 outputs, one for each digit, with several intermediate hidden layers consisting of multiple nodes. A pixel-level input to the MLP would mean, for an $N \times N$ pixel image, we will have an input layer of size $N \times N$. Training such a model is easy and straightforward. But it is not invariant to transformations of the input.

Note that digits, handwritten or otherwise, are invariant to several transformations.

• Scaling a digit within an image does not change the digit it represents.
• Translation, up/down/left/right movement of the digit within an image retains the same meaning for a digit.
• Minor rotation of the image, say less than 30 degrees in either direction, does not modify the digit.

The MLP for digit recognition that we described earlier may not be able to accurately identify the digit in the presence of scaling, translation, or minor rotation. That's because, the MLP is learning a map from the input to the output where the input needs to occur at the fixed locations. To be invariant, we will have to train the MLP with all possible transformations and combinations thereof, at all pixels in the image. This could be infeasible for digit recognition, let alone for other image recognition tasks.

What we need is a principled way of incorporating invariance arising from scaling, translation, and rotation, into the model itself. To do so, we first identify an important property of images.

Images are composed of correlated patches — there is a stronger correlation among neighboring pixels than among distant pixels. In the digit recognition example, in spite of the transformations such as scaling, translation, or minor rotation, the nearness property applies before and after the transformation; the pixels were closer before the transformation are also closer after the transformation compared to those that were distant.

This means, we need a model that can exploit the locality within the image by abstracting information from small subregions of the image, for example borders, color patches, and basic shapes. In a multilayered approach, we can then build layers of such abstraction to compose increasingly higher level objects, finally resulting in the desired output.

A convolutional neural network incorporates these ideas through local receptive fields, weight sharing, and subsampling.

## Intuition

CNNs are a specialized form of deep feedforward networks. Starting at the input layer, they are composed of multiple alternating convolutional and subsampling layers, finally followed by an output layer that is task dependent.

The convolutional layer is organized into planes, each known as a feature map. Each feature map is composed of units. Each such unit receive input from a small subregion of the input image, known as its receptive field. For example, a $N^2$ unit feature map may be arranged in a $N \times N$ grid of units, where each unit receives an input from a $M \times M$ pixel patch of the input.

The value of the unit is arrived at by a simple weighted sum of the inputs feeding into that unit. For example, for a unit with an $M \times M$ sized receptive field, we have $M \times M$ weights, a learnable parameter. In CNNs, all units of a feature map share the same weight matrix, known as the kernel of that feature map.

Feature maps feed into a subsampling layer that reduce the dimensionality of its input for subsequent layers. Subsampling reduce a $L \times L$ patch of the feature map into a single number, maybe by computing their average, maximum, or some other order statistic, depending on the predictive task. In the context of CNNs, the subsampling operation is known as pooling, often prefixed with the type of pooling — max-pooling, average-pooling, etc. Subsampling layers do not have learnable parameters, except the size of the patch being sampled, a typical hyperparameter tuned by cross-validation.

The final subsampling layer feeds into a fully-connected output layer, as in the case of an MLP, to finally arrive at an output that is relevant to task such as classification or regression.

Some modern variants, known as fully convolutional networks consist of only convolutional layers as intermediate layers, completely avoiding subsampling layers.

## The convolutional kernel

The shared kernel results in an elegant mathematical operation from the input to the feature map. With each unit in a feature map acquiring the value of a small receptive field in the input weighed by the kernel, the kernel may be imagined to be sliding along the input, each time spitting out the value of a unit in the feature map. Consider an input image $\mX$ of $P \times Q$ pixels and a kernel $mW$ of size $M \times N$. A value of a unit $u_{ij}$ in the feature map is calculated as a weighted sum of the inputs contained in a patch of size $M \times N$, the same as the kernel, with the patch starting at the pixel coordinates $(i-M, j-N)$ and ending at the pixel coordinates $(i,j)$. Mathematically, the calculation is expressed as

$$u_{ij} = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i-m,j-n} \mW_{mn}$$

This specialized operation is a well-known mathematical operation known as the convolution operation, commonly denoted with the symbol $*$, as

$$u_{ij} = \left(\mW * \mX\right) (i,j) = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i-m,j-n} \mW_{mn} \label{eqn:convolution-op}$$

It is owing to this convolutional operator that the resulting layer is known as a convolutional layer and the overall network is known as a convolutional neural network.

Mathematically, the convolution operation is defined as above, but many machine learning and deep learning libraries typically implement a somewhat related function, the cross-correlation as a surrogate for convolution (still referring to it as convolution!), calculated as

$$u_{ij} = \left(\mW * \mX\right) (i,j) = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i+m,j+n} \mW_{mn} \label{eqn:cross-correlation}$$

That being said, both have the net effect of sliding the kernel along an input plane to result in a feature map and it does not really matter which one is actually used.

## Benefits of parameter sharing

As explained in the previous section, all units of a feature map share the same weight matrix, known as a kernel. This parameter sharing or weight sharing is a necessitated for multiple reasons.

• If we had such a parameter, the $M \times M$ weight matrix $\mW$, for each of the $N^2$ units in a feature map, then we may end up with $N^2M^2$ parameters for a single convolutional layer! In a deeper architecture with several convolutional layers, that would be a prohibitively large number of parameters to train, worse than just having a fully connected layer, leading to limited generalization.
• Making a kernel smaller than the input typically results in sparse interactions or sparse connectivity. For example, an input may have millions of pixels, but the model may only need to focus on a small subset of those pixels, the edges or relevant patches, ignoring the rest. A local receptive field imposed by a kernel followed by subsampling ensures that such relevant information passes through to the feature map for more efficient predictive capability.
• Moreover, if an object within an image moves to another spot, it may no longer be recognizable, because it will no longer by transformed by the same weight matrix resulting in completely different feature map. The property of a kernel that ensures this translational robustness is known as equivariance. Mathematically, a function $f(x)$ is said to be equivariant to function $g(x)$ if it is the case that $f(g(x)) = g(f(x))$. The convolution operation described earlier is equivariant to the translation function.

## Pooling

In conventional CNNs, the feature map from the convolutional layer is subsampled in a pooling layer before being passed on to the next convolutional layer.

The pooling layer works to replace a small patch in the feature map with its summary statistic. For example, the popular max-pooling layer reduces the input patch to a single value, the maximum of all values within that patch. Other alternative pooling strategies involve taking the average, weighted average, or $L_2$ norm of the patch as a subsampling technique.

The primary goal of the pooling layer is to make the model invariant to small translations of the input. A model is said to be invariant to translation if it does not change its outputs if the input is translated in small amounts. Moreover, subsampling also results in computational efficiency as the next convolutional layer has a much smaller input to deal with.

Sometimes, pooling is essential in tasks that have to deal with variable input size. For example, the output layer in a CNN for classification is typically a fully connected layer that requires a fixed input size. If we are dealing with images of variable resolution or size, then we need intermediate pooling layers to effectively reduce the image to a fixed size before sending it to the output layer. This can be achieved by requiring the final pooling layer to output a fixed number of outputs by chopping up the input into corresponding number of regions and delivering a summary statistic for each such region.

Pooling may be applied spatially, along the same feature map, or across feature maps that resulted from different kernels applied to the previous layer. If applied along the same feature map, pooling has the net effect of subsampling to remove irrelevant units and retaining important units for the next convolutional layer. If applied across feature maps from multiple kernels, pooling effectively shortlists the kernels that generated output relevant to the next convolutional layer.

Sometimes, the convolutional layer is followed by a nonlinear activation function such as rectified linear unit (ReLU). In the context of CNNs, this intermediate layer is known as the detector stage which introduces nonlinearity to the linear activations achieved from the convolutional kernel, before the application of pooling.

## Training

Training a CNN is no different from training any deep feedforward network. After defining a task-specific loss function such as cross-entropy, the kernels of the model are fit using stochastic gradient descent by using the backpropagation algorithm for computing gradients. Hyperparameters, such as the number of layers, kernel dimensions, or pooling layer sizes are typically tuned by cross-validation.