Intuition
CNNs are a specialized form of deep feedforward networks.
Starting at the input layer, they are composed of multiple alternating convolutional and subsampling layers, finally followed by an output layer that is task dependent.
The convolutional layer is organized into planes, each known as a feature map.
Each feature map is composed of units. Each such unit receive input from a small subregion of the input image, known as its receptive field.
For example, a \(N^2\) unit feature map may be arranged in a \( N \times N \) grid of units, where each unit receives an input from a \( M \times M \) pixel patch of the input.
The value of the unit is arrived at by a simple weighted sum of the inputs feeding into that unit.
For example, for a unit with an \( M \times M \) sized receptive field, we have \( M \times M \) weights, a learnable parameter.
In CNNs, all units of a feature map share the same weight matrix, known as the kernel of that feature map.
Feature maps feed into a subsampling layer that reduce the dimensionality of its input for subsequent layers.
Subsampling reduce a \( L \times L \) patch of the feature map into a single number, maybe by computing their average, maximum, or some other order statistic, depending on the predictive task.
In the context of CNNs, the subsampling operation is known as pooling, often prefixed with the type of pooling — max-pooling, average-pooling, etc.
Subsampling layers do not have learnable parameters, except the size of the patch being sampled, a typical hyperparameter tuned by cross-validation.
The final subsampling layer feeds into a fully-connected output layer, as in the case of an MLP, to finally arrive at an output that is relevant to task such as classification or regression.
Some modern variants, known as fully convolutional networks consist of only convolutional layers as intermediate layers, completely avoiding subsampling layers.