Consider an example computer vision task of recognizing handwritten digits — given an image of a handwritten digit, the model needs to accurately recognize the digit in the image.
Suppose we try to address this challenge with a multilayer perceptron (MLP) model.
A simple implementation will involve pixel-level inputs and 10 outputs, one for each digit, with several intermediate hidden layers consisting of multiple nodes.
A pixel-level input to the MLP would mean, for an \( N \times N \) pixel image, we will have an input layer of size \( N \times N \).
Training such a model is easy and straightforward. But it is not invariant to transformations of the input.
Note that digits, handwritten or otherwise, are invariant to several transformations.
- Scaling a digit within an image does not change the digit it represents.
- Translation, up/down/left/right movement of the digit within an image retains the same meaning for a digit.
- Minor rotation of the image, say less than 30 degrees in either direction, does not modify the digit.
The MLP for digit recognition that we described earlier may not be able to accurately identify the digit in the presence of scaling, translation, or minor rotation.
That's because, the MLP is learning a map from the input to the output where the input needs to occur at the fixed locations.
To be invariant, we will have to train the MLP with all possible transformations and combinations thereof, at all pixels in the image.
This could be infeasible for digit recognition, let alone for other image recognition tasks.
What we need is a principled way of incorporating invariance arising from scaling, translation, and rotation, into the model itself.
To do so, we first identify an important property of images.
Images are composed of correlated patches — there is a stronger correlation among neighboring pixels than among distant pixels.
In the digit recognition example, in spite of the transformations such as scaling, translation, or minor rotation, the nearness property applies before and after the transformation;
the pixels were closer before the transformation are also closer after the transformation compared to those that were distant.
This means, we need a model that can exploit the locality within the image by abstracting information from small subregions of the image, for example borders, color patches, and basic shapes.
In a multilayered approach, we can then build layers of such abstraction to compose increasingly higher level objects, finally resulting in the desired output.
A convolutional neural network incorporates these ideas through local receptive fields, weight sharing, and subsampling.