Introduction to deep feedforward networks

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Multilayer perceptrons
        Deep Learning
      

Introduction

Multilayer perceptrons, also known as deep feedforward networks, are the most basic of deep neural networks. Their name arises from their networked design consisting of multiple layers of perceptrons resulting in one-directional flow of inputs — forward — through the model towards the final output. To appreciate more evolved deep neural networks such as convolutional neural networks or recurrent neural networks, it is crucial to first thoroughly understand deep feedforward networks.

Prerequisites

To understand multilayer perceptrons, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

Deep feedforward networks can be used for many predictive tasks, including, but not limited to, classification and regression.

In classification, the goal of the predictive model is to identify the class that generated a particular instance. In regression, the model is required to predict a continuous valued output for a given multivariate instance.

Consider an instance $ \vx \in \real^\ndim $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $. In classification, the predictive model needs to assign the instance to one of the $ \nclass $ classes $ C_1, C_2, \ldots, C_\nclasssmall $ depending on the values of the $ \ndim $ features . In regression, we need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

For both supervised learning settings, the predictive model is inferred over a collection of labeled observations provided as tuples $ (\vx_i,y_i) $ containing the instance vector $ \vx_i $ and the true target variable $ y_i $. For classification tasks, $ y_i \in \set{C_1,\ldots,C_\nclasssmall} $ and in regression tasks, $ y_i \in \real $, for all $ y_i $. This collection of labeled observations is known as the training set or labeled set $ \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} $.

Rosenblatt's perceptron: A quick recap

The predictive model of the perceptron is

\begin{equation} \yhat = f \left(\vw^T \vx + b \right) \label{eqn:class-pred} \end{equation}

where,

$ \vw \in \real^{N} $ is the parameter, the so-called weights of the model,
$ b $ is the bias of the model.
$ f $ is known as the activation function. It is a step function of the form

\begin{equation} f(a) = \begin{cases} +1, ~~~~a \ge 0, \\\\ -1, ~~~~a < 0. \end{cases} \end{equation}

In the upcoming sections, we will use the term perceptron to mean the linear model $ \vw^T \vx + b $ with parameters $ \set{\vw, b} $ and investigate the role and nature of activation function separately.

The XOR problem

The XOR problem is an easy to understand but challenging classification problem. The classic problem involves a 2-dimensional binary input variables mapped to the positive and negative class, as follows:

When both dimensions are equal (both 1 or both 0), then the instance belongs to the negative class or class 0.
When the dimensions are unequal, then the instance belongs to the positive class or class 1.

Given these cases, there are only 4 possible examples in the dataset

$ \vx $	$y$
$ [0,0] $	$0$
$ [0,1] $	$1$
$ [1,0] $	$1$
$ [1,1] $	$0$

Can the perceptron algorithm accurately fit this training set? Let's find out.

Addressing the challenge of nonlinearity

The XOR-problem is a simple example of a nonlinearly separable problem. A simple linear model, for example a perceptron, cannot classify such problems. Most problems of practical importance are nonlinearly separable problems. We need a principled strategy to address this challenge.

Attempt 1: Scalar multiplication

Consider a perceptron model with parameters $ \vw $ and $ b $. The perceptron output, $ \vw^T \vx + b $, is a linear function of the inputs $ \vx $. Multiplying it with a scalar will only scale the linear output. Still retaining the linear nature. We need something more than a multiplicative scalar.

Attempt 2: Adding two perceptrons

Instead of one, let's use two perceptrons, say $ \sP = \set{\vw_p, b_p} $ and $ \sQ = \set{\vw_q, b_q} $. Suppose these outputs are $ h_p = \vw_p^T \vx + b_p $ and $ h_q = \vw_q^T \vx + b_q $. Again, both $ h_p $ and $ h_q $ are linear functions of the input $ \vx $.

What if add the outputs of these two perceptrons, do we get a nonlinear outputs? Let's find out.

\begin{aligned} o &= h_p + h_q \\\\ &= (\vw_p^T\vx + b_p) + (\vw_q^T\vx + b_q) \\\\ &= \left[\vw_p + \vw_q \right]^T\vx + (b_p + b_q) \\\\ \end{aligned}

Thus, our final output $ o $ is still a linear function of the input. We need something more than additive perceptrons.

Attempt 3: Stacking perceptrons

Let's treat the outputs of our two perceptrons, $ h_p $ and $ h_q $, as a two-dimensional vector $ \vh = [h_p, h_q] $. If we apply another perceptron, say $ \sO = \set{\vw_o, b_o} $ to the vector $ \vh $, we get $ o = \vw_o^T \vh + b_o $.

Is $ o $ a linear function of the original inputs $ \vx $? Let's find out.

\begin{aligned} o &= \vw_o^T \vh + b_o \\\\ &= \vw_o^T [h_p,h_q]^T + b_o \\\\ &= w_{o1}h_p + w_{o2}h_q + b_o \\\\ &= w_{o1}(\vw_p^T\vx + b_p) + w_{o2}(\vw_q^T\vx + b_q) + b_o \\\\ &= \left[w_{o1}\vw_p + w_{o2}\vw_q \right]^T\vx + w_{o1}b_p + w_{o2}b_q + b_o \\\\ \end{aligned}

where, we have explicitly written the elements of the vector $ \vw_o = [w_{o1}, w_{o2}] $.

Thus, $ o $ is still a linear function of the inputs $ \vx $.

Merely scaling, adding, or stacking perceptrons does not address the challenge of nonlinearity. What can we do to introduce nonlinearity into this model?

Introducing nonlinearity into stacked perceptrons

Consider the following piecewise-linear function that works on scalar input.

$$ \phi(a) = \max\set{0,a} $$

It is piecewise-linear because it stays flat at $ 0 $ for negative values of $ a $ and is equal to $ a $ for positive values.

Let's apply this as a transformation on the outputs of the first layer of linear transforms in our previous thought experiment. The outputs of the first layer are now

$$ h_p = \phi(\vw_p^T \vx + b_p) = \max\set{0, \vw_p^T \vx + b_p}$$ $$ h_q = \phi(\vw_q^T \vx + b_q) = \max\set{0, \vw_q^T \vx + b_q}$$

These outputs are now already nonlinear with respect to the input. Even if we apply a linear transform on them, the final output will be nonlinear.

\begin{aligned} o &= \vw_o^T \vh + b_o \\\\ &= \vw_o^T [h_p,h_q]^T + b_o \\\\ &= w_{o1}h_p + w_{o2}h_q + b_o \\\\ &= w_{o1}\max\set{0, \vw_p^T\vx + b_p} + w_{o2}\max\set{0, \vw_q^T\vx + b_q} + b_o \\\\ \end{aligned}

We can no longer represent this as a linear function of the inputs $ \vx $.

By introducing a nonlinear function between the stacked perceptrons, we were able to introduce nonlinearity into the overall model. But does this strategy help us address our XOR challenge? Let's find out.

Cracking the XOR problem

Consider the parameter values as follows:

\begin{aligned} \vw_p &= [0.5,0.5] \\\\ b_p &= 0.0 \\\\ \vw_q &= [0.5, 0.5] \\\\ b_q &= -0.5 \\\\ \vw_o &= [2.0, -4.0] \\\\ b_o &= 0 \end{aligned}

For our 4 examples in the XOR problem, these parameter values in result in the following transformations.

$ \vx $	$y$	$ h_p = \vw_p^T \vx + b_p $	$ h_q = \vw_q^T + b_q $	$ \max\set{0, h_p} $	$\max\set{0, h_q} $	$ \vh = [\phi(h_p), \phi(h_q)] $	$ o = \vw_o^T \vx + b_o $
$ [0,0] $	$0$	0.0	-0.5	0.0	0.0	$[0.0, 0.0] $	0.0
$ [0,1] $	$1$	0.5	0.0	0.5	0.0	$[0.5, 0.0] $	1.0
$ [1,0] $	$1$	0.5	0.0	0.5	0.0	$[0.5, 0.0] $	1.0
$ [1,1] $	$0$	1.0	0.5	1.0	0.5	$[1.0, 0.5] $	0.0

As you can observe in the final column, the outputs $ o $ match the true target variables $ y $ for all the rows of data. Perfect! We were able to accurately predict a nonlinear output using perceptrons as building blocks with an intermediate piecewise linear function.

Activation functions

The piecewise linear function that helped us introduce nonlinearity in the model is known as the rectified linear unit (ReLU). There are many such functions to choose from for the purpose of making nonlinear predictive models. They are known as activation functions. These include ReLU, sigmoid, hyperbolic tangent $\tanh$, leaky ReLU, to name a few. That being said, the default recommendation in modern deep learning and consequently the most widely used activation function is the ReLU activation function.

For a more detailed overview of the variety and purposes of various activation functions, refer our comprehensive article on activation function.

A multilayer architecture

The example we demonstrated used perceptrons that followed one another with an intermediate activation function. Each perceptron in the model is known as a node. The group of perceptrons that apply on the same vector of inputs is considered a layer. In our example, the two perceptrons $ \sP = \set{\vw_p, b_p} $ and $ \sQ = \set{\vw_q, b_q} $ form a layer that works on the same input vector $ \vx $.

The same concept can be generalized to include many more layers, one after the other, to build a model that can conform to any complicated nonlinear function. Models with several layers are known as multilayer perceptrons (MLP). Because individual perceptrons form a network in an MLP with a one-way flow from input to output through the network, they are also known as feedforward networks (FFN). In modern times, the number of layers has increased significantly. Such deeper networks are now commonly known as deep feedforward networks (DFFN).

The number of layers in the model is known as the depth. The number of perceptrons in a single layer is known as its width. The overall structure of an MLP is known as its architecture.

Irrespective of how deep they are, they typically share the common characteristics:

Input layer: As the name suggests, this layer is just the input vector $ \vx $. No perceptrons here.
Hidden layer(s): These are the intermediate layers. In our XOR-example, we had a single layer with two perceptrons. We can have many more perceptrons in the hidden layer, and many more such hidden layers, one after the other, each working on the output of the previous hidden layer.
Output layer: The final layer provides the output of the overall model — the predictions that we desire.

Each of these layers have recommended design criteria. We will elaborate these next.

The input layer

The input layer is merely the inputs to the predictive model $ \vx $. Not much happens in the input layer, except some desirable preprocessing that is relevant to the task at hand.

Common across most tasks is the practice of standardizing each input variable to unit mean and zero variance. Alternatively, each input variable may be scaled to a fixed range $ [0,1] $. Such preprocessing typically helps with well-behaved weights for the model. More details about preprocessing can be found in our comprehensive article on preprocessing techniques.

The hidden layer(s)

These are the intermediate layers of the model. These layers consist of typical perceptrons or nodes. Each perceptron is equipped with a weight vector and a bias. In our example, the two perceptrons $ \sP = \set{\vw_p, b_p} $ and $ \sQ = \set{\vw_q, b_q} $ formed a hidden layer that worked on the input layer $ \vx $.

For notational convenience, the weights in a layer can be collectively represented as a weight matrix $ \mW_= [\vw_p,\vw_q]^T $. Similarly, the bias of all perceptrons in a layer can be collectively represented as a bias vector $ \vb = [b_p, b_q] $. In a multilayered architecture, with $ L $ layers, we can succinctly denote the per-layer weights as matrices $ \mW_1, \mW_2, \ldots, \mW_L $ and the per-layer biases as vectors $ \vb_1, \vb_2, \ldots, \vb_L $.

With this concise notation, the output of the $l$-th hidden layer can be represented with the following compact equation.

$$ \vh_l = \phi\left(\mW_l^T \vh_{l-1} + \vb_l\right) $$

Note that the $ l $-th layer works on the output of the $ (l-1) $-th layer, $ \vh_{l-1} $ as input. The input to the first hidden layer is the input $ \vx $, so that $ \vh_0 = \vx $

In the above equation, $ \phi $ denotes the nonlinear activation function. As we motivated earlier, it is crucial to have such a nonlinear activation functions after each hidden layer. Stacked hidden layers devoid of any nonlinear activation function are ineffective because such a stack is linear a function of its inputs. It is common practice to have the same activation function on all hidden layers to simplify model development and representation. The ReLU activation function is the default recommendation for modern deep networks.

The output layer

The final layer — the output layer — provides the output of the predictive model. It is just another perceptron, $ \sO = \set{\vw_o, b_o} $ that works on the final hidden layer in the network.

Typically, activation functions are not applied to the output layer, unless the task specifically requires constraining the output in a certain way. Some examples when output activation may be desirable are as follows

unconstrained output: No activation
binary or categorical output: softmax activation
output in the range $ [0,1] $: sigmoid activation
output in the range $ [-1, 1] $: tanh activation
positive real-valued output in the range $ [0,\infty) $: ReLU activation

For more details refer our comprehensive article on activation functions.

Training

In the worked out example in the previous section, we used specific pre-calculated values of the parameters to demonstrate that it is possible to get nonlinear outputs for two-layer perceptrons with ReLU activation. In practical scenarios, we would have to train such a model to fit the parameters to the available training data.

For training the model, in the case of multilayer perceptrons, as with any supervised model, we first define a loss function that we wish to minimize. For deep learning, in the case of classification, the most common loss is the cross-entropy loss while in the case of regression, the default recommendation is to use the mean squared error loss. For more details, refer our comprehensive article on loss functions.

Being a highly parameterized model, regularization plays an important role in achieving better results from training. We cover a diversity of regularization approaches, such as weight decay and dropout, in our detailed article on regularization techniques in deep learning.

With the loss function defined and the regularization set up, the parameters of the model can now be fit to the data such that the parameter values minimize the loss. In the case of deep learning, this is typically achieved by a gradient-based optimization technique known as backpropagation. Refer our article on backpropagation to understand the detailed mechanics of the approach.

Versatility

MLPs are extremely versatile. They can be made to conform to complicated nonlinear functions by adjusting their depth and width of the layers. Moreover, as we outlined in the section on output layer, by just choosing the output layer activations and appropriately defining the loss functions for training the model, they can be designed to predict for a variety of tasks; for example, classification, regression, and anomaly detection, to name a few.

Owing to this versatility, they have been successfully deployed in a variety of applications in diverse domains. Most of the famous deep networks are specialized extensions of MLPs with specific architectural considerations and building blocks.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

\( \vx \)	\(y\)
\( [0,0] \)	\(0\)
\( [0,1] \)	\(1\)
\( [1,0] \)	\(1\)
\( [1,1] \)	\(0\)