Convolutional neural networks (CNN)

Deep Learning


Motivated by human vision and designed to be robust against invariant transformations such as scaling, translation, and rotations , convolutional neural networks (CNN) have become the de facto standard for computer vision tasks in the recent decade. In fact, much of the current deep learning revolution started with deep convolutional networks setting new state-of-the-art records on major image recognition tasks. That being said, CNNs or their variants are widely used in applications beyond computer vision.


To understand CNNs, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Image classification using MLPs

Consider an example image recognition task of recognizing animals in pictures.

Suppose we try to address this challenge with a multilayer perceptron (MLP) model. A simple implementation will involve pixel-level inputs and 10 outputs, one for each digit, with several intermediate hidden layers consisting of multiple nodes. A pixel-level input to the MLP would mean, for an \( N \times N \) pixel image, we will have an input layer of size \( N \times N \). Training such a model is easy and straightforward.

The challenge with images

But the MLP that we trained above, albeit simple, suffers from several big challenges. It is not invariant to transformations of the images. Let's see what that means.

Note that images are invariant to several transformations.

  • Scaling a digit within an image does not change the picture it represents. Even if the scale the image to the same \( N \times N \) pixels, if the object within is not scaled to the same proportion, our MLP will be ineffective.
  • Translation, up/down/left/right movement of the object within an image retains the same meaning for the object.
  • Minor rotation of the image, say less than 30 degrees in either direction, does not modify the object.
  • Mirror image of an object, stays the same kind of object and should not affect the classification performance.

The MLP for image recognition that we described earlier may not be able to accurately identify the digit in the presence of scaling, translation, or minor rotation. That's because, the MLP is learning a map from the input to the output where the input needs to occur at the fixed locations. To be invariant, we will have to train the MLP with all possible transformations and combinations thereof, at all pixels in the image. This could be infeasible for most meaningful image recognition tasks.

Overcoming the challenge

What we need is a principled way of incorporating invariance arising from scaling, translation, and rotation, into the model itself. To do so, we first identify an important property of images.

Images are composed of correlated patches — there is a stronger correlation among neighboring pixels than among distant pixels. In the digit recognition example, in spite of the transformations such as scaling, translation, or minor rotation, the nearness property applies before and after the transformation; the pixels were closer before the transformation are also closer after the transformation compared to those that were distant.

This means, we need a model that can exploit the locality within the image by abstracting information from small subregions of the image, for example borders, color patches, and basic shapes. In a multilayered approach, we can then build layers of such abstraction to compose increasingly higher level objects, finally resulting in the desired output.

A convolutional neural network incorporates these ideas through local receptive fields, weight sharing, and subsampling.


CNNs are a specialized form of deep feedforward networks. Starting at the input layer, they are composed of multiple alternating convolutional and subsampling layers, finally followed by an output layer that is task dependent.

The convolutional layer is organized into planes, each known as a feature map. Each feature map is composed of units. Each such unit receive input from a small subregion of the input image, known as its receptive field. For example, a \(N^2\) unit feature map may be arranged in a \( N \times N \) grid of units, where each unit receives an input from a \( M \times M \) pixel patch of the input.

The value of the unit is arrived at by a simple weighted sum of the inputs feeding into that unit. For example, for a unit with an \( M \times M \) sized receptive field, we have \( M \times M \) weights, a learnable parameter. In CNNs, all units of a feature map share the same weight matrix, known as the kernel of that feature map.

Feature maps feed into a subsampling layer that reduce the dimensionality of its input for subsequent layers. Subsampling reduce a \( L \times L \) patch of the feature map into a single number, maybe by computing their average, maximum, or some other order statistic, depending on the predictive task. In the context of CNNs, the subsampling operation is known as pooling, often prefixed with the type of pooling — max-pooling, average-pooling, etc. Subsampling layers do not have learnable parameters, except the size of the patch being sampled, a typical hyperparameter tuned by cross-validation.

The final subsampling layer feeds into a fully-connected output layer, as in the case of an MLP, to finally arrive at an output that is relevant to task such as classification or regression.

Some modern variants, known as fully convolutional networks consist of only convolutional layers as intermediate layers, completely avoiding subsampling layers.

The convolutional kernel

The shared kernel results in an elegant mathematical operation from the input to the feature map. With each unit in a feature map acquiring the value of a small receptive field in the input weighed by the kernel, the kernel may be imagined to be sliding along the input, each time spitting out the value of a unit in the feature map. Consider an input image \( \mX \) of \( P \times Q \) pixels and a kernel \( mW \) of size \( M \times N \). A value of a unit \( u_{ij} \) in the feature map is calculated as a weighted sum of the inputs contained in a patch of size \( M \times N \), the same as the kernel, with the patch starting at the pixel coordinates \( (i-M, j-N) \) and ending at the pixel coordinates \( (i,j) \). Mathematically, the calculation is expressed as

$$ u_{ij} = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i-m,j-n} \mW_{mn} $$

This specialized operation is a well-known mathematical operation known as the convolution operation, commonly denoted with the symbol \( * \), as

\begin{equation} u_{ij} = \left(\mW * \mX\right) (i,j) = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i-m,j-n} \mW_{mn} \label{eqn:convolution-op} \end{equation}

It is owing to this convolutional operator that the resulting layer is known as a convolutional layer and the overall network is known as a convolutional neural network.

Mathematically, the convolution operation is defined as above, but many machine learning and deep learning libraries typically implement a somewhat related function, the cross-correlation as a surrogate for convolution (still referring to it as convolution!), calculated as

\begin{equation} u_{ij} = \left(\mW * \mX\right) (i,j) = \sum_{m=1}^{M} \sum_{n=1}^N \mX_{i+m,j+n} \mW_{mn} \label{eqn:cross-correlation} \end{equation}

That being said, both have the net effect of sliding the kernel along an input plane to result in a feature map and it does not really matter which one is actually used.

Benefits of parameter sharing

As explained in the previous section, all units of a feature map share the same weight matrix, known as a kernel. This parameter sharing or weight sharing is a necessitated for multiple reasons.

  • If we had such a parameter, the \( M \times M \) weight matrix \( \mW \), for each of the \( N^2 \) units in a feature map, then we may end up with \( N^2M^2 \) parameters for a single convolutional layer! In a deeper architecture with several convolutional layers, that would be a prohibitively large number of parameters to train, worse than just having a fully connected layer, leading to limited generalization.
  • Making a kernel smaller than the input typically results in sparse interactions or sparse connectivity. For example, an input may have millions of pixels, but the model may only need to focus on a small subset of those pixels, the edges or relevant patches, ignoring the rest. A local receptive field imposed by a kernel followed by subsampling ensures that such relevant information passes through to the feature map for more efficient predictive capability.
  • Moreover, if an object within an image moves to another spot, it may no longer be recognizable, because it will no longer by transformed by the same weight matrix resulting in completely different feature map. The property of a kernel that ensures this translational robustness is known as equivariance. Mathematically, a function \( f(x) \) is said to be equivariant to function \( g(x) \) if it is the case that \( f(g(x)) = g(f(x)) \). The convolution operation described earlier is equivariant to the translation function.


In conventional CNNs, the feature map from the convolutional layer is subsampled in a pooling layer before being passed on to the next convolutional layer.

The pooling layer works to replace a small patch in the feature map with its summary statistic. For example, the popular max-pooling layer reduces the input patch to a single value, the maximum of all values within that patch. Other alternative pooling strategies involve taking the average, weighted average, or \( L_2 \) norm of the patch as a subsampling technique.

The primary goal of the pooling layer is to make the model invariant to small translations of the input. A model is said to be invariant to translation if it does not change its outputs if the input is translated in small amounts. Moreover, subsampling also results in computational efficiency as the next convolutional layer has a much smaller input to deal with.

Sometimes, pooling is essential in tasks that have to deal with variable input size. For example, the output layer in a CNN for classification is typically a fully connected layer that requires a fixed input size. If we are dealing with images of variable resolution or size, then we need intermediate pooling layers to effectively reduce the image to a fixed size before sending it to the output layer. This can be achieved by requiring the final pooling layer to output a fixed number of outputs by chopping up the input into corresponding number of regions and delivering a summary statistic for each such region.

Pooling may be applied spatially, along the same feature map, or across feature maps that resulted from different kernels applied to the previous layer. If applied along the same feature map, pooling has the net effect of subsampling to remove irrelevant units and retaining important units for the next convolutional layer. If applied across feature maps from multiple kernels, pooling effectively shortlists the kernels that generated output relevant to the next convolutional layer.

Sometimes, the convolutional layer is followed by a nonlinear activation function such as rectified linear unit (ReLU). In the context of CNNs, this intermediate layer is known as the detector stage which introduces nonlinearity to the linear activations achieved from the convolutional kernel, before the application of pooling.


Training a CNN is no different from training any deep feedforward network. After defining a task-specific loss function such as cross-entropy, the kernels of the model are fit using stochastic gradient descent by using the backpropagation algorithm for computing gradients. Hyperparameters, such as the number of layers, kernel dimensions, or pooling layer sizes are typically tuned by cross-validation.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.