Introduction to autoencoders

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Autoencoders
        Deep Learning
      

Introduction

Autoencoders are neural networks that are optimized to convert the input into a version of itself through a hidden layer acting as a learned code. They are specifically designed to infer the useful characteristics of data by learning to ignore the noise and focusing on the relevant generalizable patterns. This is typically achieved by limiting the code to fewer degrees of freedom than the original input. For example, this could be performed by introducing a bottleneck hidden layer that is much smaller than the input layer and the output layer This constrains the code, the activations of the hidden layer, to focus only on the differentiating patterns in the input, as opposed to explaining all its idiosyncrasies.

Intuitively, they are an advanced form of dimensionality reduction approach, unlike linear transforms such as principal component analysis.

Prerequisites

To understand autoencoders, we recommend familiarity with the concepts in

Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Multilayer perceptrons
Backpropagation
Principal component analysis

Follow the above links to first get acquainted with the corresponding concepts.

Intuition

Dimensionality reduction requires the representation of data in fewer dimensions than the original. How do we discover the limited dimensions that can faithfully reproduce the data? Useful data have patterns in them. If we can discover these patterns and encode them in fewer dimensions than the input, then we can attempt to reverse engineer or decode these reduced dimensional representation to faithfully recover the input. This is the primary goal of autoencoders — discover an intermediate coding language for a good reconstruction of the input.

$$ \text{input} \xrightarrow{\text{encode}} \text{code} \xrightarrow{\text{decode}} \text{input} $$

If we allow the code to have many degrees of freedom, it will conform to all the idiosyncrasies in the input and on decoding, quite possibly lead to the original input exactly. Such an autoencoder will have a very high capacity for conforming to the input, in effect merely memorizing the input. Not quite useful.

If we constrain the code to be much smaller than the input, we may not be able to exactly recover the input. But, we can expect a good code to abstract out the differentiating and salient patterns in the data, while ignoring the common noise or unimportant variations among the input examples. Such an encoder with a constrained code that is much smaller than the input is known as an undercomplete autoencoder. Much of the art in building an autoencoder involves choosing the appropriate level of undercompleteness to discover discerning and important patterns in the data.

A simple autoencoder

Achieving a simple undercomplete autoencoder is quite straightforward. We use a network architecture shaped like an hourglass — Starting at a wide input layer, create a stack of hidden layers that are narrower than the input to build the encoder portion of the network. The coding language is the activation of the final layer of the encoder, the narrowest hidden layer of the overall network, on an input. For the decoder portion, again widen the layers up to the output layer that is as wide as the input layer.

Consider an $ \ndim $-dimensional input vector $ \vx \in \real^\ndim $. Let $ e: \ndim \to \nclass $ denote the encoder function and $ d: \nclass \to \ndim $ denote the decoder function, such that $ \nclass $ is much smaller than $ \ndim $. That is, $ \nclass \lll \ndim $.

$$ \vx \xrightarrow{\text{encode}} \vh = e(\vx) \xrightarrow{\text{decode}} \overset{\sim}{\vx} = d(e(\vx)) $$

Let the loss of this custom deep feedforward network be represented as $ \loss(\vx, d(e(\vx))) $. An example loss for training the autoencoder is the $L_2$-norm

$$ \loss(\vx, d, e) = \norm{\vx - d(e(\vx))}{2} $$

With this loss, training an autoencoder follows the same strategy as any deep feedforward network — gradient-based optimization, with gradients computed using backpropagation.

Sparse autoencoder

In some applications, we wish to introduce sparsity into the coding language, so that different input examples activate separate elements of the coding vector. Sparsity in the coding language can be achieved by regularizing the autoencoder with a sparsifying penalty on the code $ \vh $.

$$ \loss_{\text{sparse}}(\vx, d, e) = \loss(\vx, d, e) + \Omega(e(\vx)) $$

where, $ \Omega(\vh) $ is some sparsifying penalty on the code $ e(\vx) = \vh $. For example, we can use $ L_1 $-norm.

Denoising autoencoder

As we mentioned earlier, a desirable property of an autoencoder is undercompleteness. This is typically achieved by constraining the size of the code by controlling the capacity of the network. An autoencoder with high capacity may just memorize the input, in effect learning an identity function to map the input to itself exactly. This is useless for most tasks.

We need ways to avoid learning the identity mapping. Undercompleteness is one way. An autoencoder can be made to learn something useful by deliberately introducing noise in the input and requiring the autoencoder to recover the original input. If $ \dash{\vx} $ denotes the input with some added noise, then, the loss function of the denoising autoencoder is

$$ \loss_{\text{denoising}}(\vx, d, e)) = \norm{\vx - d(e(\dash{\vx}))}{2} $$

With such a set up, it is unlikely to learn the identity mapping because the input and output are actually different. With a denoising encoder, we can actually hope that the autoencoder learns something useful.

Relation of autoencoders to PCA

We have covered principal component analysis (PCA) extensively before. Intuitively, PCA is a linear dimensionality reduction approach that works by finding the principal components or basis that lead to the least reconstruction error on the dataset.

Autoencoders are a nonlinear approach, again trying to discover the code that minimizes the reconstruction error of the data. In fact, if the decoder is linear and the loss function is mean squared error, an autoencoder infers the same subspace as the PCA. In other words, under these circumstances, the autoencoder will discover the principal subspace of the training data, much like the PCA.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.