Principal component analysis (PCA)

Machine Learning

Introduction

Principal component analysis (PCA), also known as Karhunen-Loéve transform. It is primarily used as a dimensionality reduction approach with applications in data visualization, feature extraction, and lossy data compression.

Prerequisites

To understand PCA, we recommend familiarity with the concepts in

  • Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
  • Linear algebra: A good understanding of linear algebra, complete with projections and orthonormal vectors.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In dimensionality reduction, the goal of the inferred model is to represent the input data with the fewest number of dimensions, while still retaining information that is relevant to the task at hand.

Consider observations of the form \( \vx \in \real^\ndim \) — vectors consisting of \( \ndim \) features, \(\vx = [x_1, x_2, \ldots, x_\ndim] \). A collection of such observations is provided in the unlabeled set \( \unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled} \).

We need to represent these points as approximations \( \tilde{\vx}_\nunlabeledsmall \in \real^M\), where \( M < \ndim \), for all \( \nunlabeledsmall=1,\ldots,\nunlabeled \).

Intuitive approach

Principal component analysis is probably among the easiest implemented dimensionality reduction approaches. The idea is straightforward.

To approximate to \( M \) dimensions,

  1. Estimate the covariance matrix of the input data.
  2. Perform eigendecomposition on the covariance matrix.
  3. Find the eigenvectors corresponding to the largest \( M \) eigenvalues. These chosen eigenvectors are the orthonormal basis for the reduced dimensional space.
  4. Project the input data points onto the reduced orthonormal basis.

Read on to find out why the largest eigenvectors of the data covariance matrix lead to a good approximation of the input data.

The orthonormal basis

As a dimensionality reduction approach, PCA can be viewed as arriving at a projection of the original data into fewer dimensions while incurring minimal loss.

Consider a set of data points \( \vx_\nunlabeledsmall \in \real^\ndim \) in an \(\ndim\)-dimensional space, where \( \nunlabeledsmall = 1, \ldots, \nunlabeled \). Suppose the desired projection space can be spanned by the complete orthonormal basis vectors \( \vu_\ndimsmall \in \real^\ndim \) where \( \ndimsmall \in \set{1, \ldots, \ndim} \)

This means, each data point \( \vx_\nunlabeledsmall \) in this space can be represented as a linear combination of these orthonormal basis.

\begin{equation} \vx_\nunlabeledsmall = \sum_{\ndimsmall = 1}^{\ndim} \alpha_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall \label{eqn:proj-alpha} \end{equation}

Observe that the coefficients \( \alpha_{\nunlabeledsmall\ndimsmall} \) will be specific to each data point \( \vx_\nunlabeledsmall \).

Taking an inner product of the above equation with \( \vu_\ndimsmall \) on both sides, we get

\begin{align} & \vx_\nunlabeledsmall^T \vu_\ndimsmall = \sum_{\ndimsmall = 1}^{\ndim} \alpha_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall^T \vu_\ndimsmall \\\\ \Rightarrow& \vx_\nunlabeledsmall^T \vu_\ndimsmall = \sum_{\ndimsmall = 1}^{\ndim} \alpha_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall^T \vu_\ndimsmall \\\\ \Rightarrow& \vx_\nunlabeledsmall^T \vu_\ndimsmall = \alpha_{\nunlabeledsmall\ndimsmall} \end{align}

The final equation is the result of two properties of the orthonormal basis

  1. The inner product of vectors that are orthogonal to each other is zero. This means, all the terms with \( \vu_\ndimsmall^T \vu_j \), where \( j \ne \ndimsmall \) vanish because they are zero. The only term remaining on the right hand side is \( \alpha_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall^T \vu_\ndimsmall \).
  2. The inner product of orthornomal vectors, \( \vu_\ndimsmall^T \vu_\ndimsmall = \norm{\vu_\ndimsmall}{} = 1 \). So, the right hand side turns to \( \alpha_{\nunlabeledsmall\ndimsmall} \).

Substituting this value of \( \alpha_{\nunlabeledsmall\ndimsmall} \) into the Equation \eqref{eqn:proj}, we get

\begin{equation} \vx_\nunlabeledsmall = \sum_{\ndimsmall = 1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall \label{eqn:proj-xvu} \end{equation}

Reducing dimensions

The formulation in Equation \eqref{eqn:proj-xvu} presents the input \( \vx_\nunlabeledsmall \) in terms of all the \( \ndim \) dimensions. With dimensionality reduction, our goal is to come up with a projection that involves fewer dimensions \( M < \ndim \). Without loss of generality, the reduced dimensions can be represented by the first \( M \) basis vectors in \( \set{\vu_\ndimsmall}_{\ndimsmall=1}^{\ndim} \).

In other words, we are splitting the representation from the \( \ndim\)-dimensional space into two components — the first \( M \) dimensions and then the remaining \( M+1, \ldots, \ndim \) dimensions.

\begin{equation} \vx_\nunlabeledsmall = \sum_{\ndimsmall = 1}^{M} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall + \sum_{\ndimsmall = M+1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall
\label{eqn:proj-xvu-split} \end{equation}

For dimensionality reduction, we are desiring an approximation \( \tilde{\vx}_{\nunlabeledsmall} \) of the original data point \( \vx_\nunlabeledsmall \).

\begin{equation} \tilde{\vx}_\nunlabeledsmall = \sum_{\ndimsmall = 1}^{M} z_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall + \sum_{\ndimsmall = M+1}^{\ndim} b_{\ndimsmall} \vu_\ndimsmall
\label{eqn:proj-xvu-split-zb} \end{equation}

The weights \( z_{\nunlabeledsmall\ndimsmall} \) are specific for the data point \( \vx_{\nunlabeledsmall} \), while the weights \( b_{\nunlabeledsmall} \) are the same across data points. In other words, this second term is like a constant that we add to all data points as a bias term.

Fitting the PCA: Reconstruction error

From Equation \eqref{eqn:proj-xvu-split-zb}, it should be clear that to fit the PCA to data, we need to infer three kinds of parameters — \(\vu_\ndimsmall, ~b_\ndimsmall, ~z_{\nunlabeledsmall\ndimsmall} \) for all dimensions \( \ndimsmall = 1,\ldots,\ndim \) and data points \( \nunlabeledsmall=1,\ldots,\nunlabeled \).

Since \( \tilde{\vx}_\nunlabeledsmall \) is an approximation of \( \vx_\nunlabeledsmall \), a natural loss function to minimize is the mean squared error, written below in terms of the norm

$$ \loss = \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \norm{\vx_\nunlabeledsmall - \tilde{\vx}_\nunlabeledsmall}{}^2 $$

Fitting a PCA: Optimal values for \( z \) and \( b \)

To minimize this loss, we differentiate the above equation with respect to the parameters to arrive at their optimal values. These are

$$ z_{\nunlabeledsmall\ndimsmall} = \vx_\nunlabeledsmall^T \vu_\ndimsmall, ~~~~\forall \ndimsmall=1,\ldots,M, \nunlabeledsmall=1,\ldots,\nunlabeled$$

$$ b_\ndimsmall = \bar{\vx}^T \vu_\ndimsmall, ~~~~\forall \ndimsmall=M+1,\ldots,\ndim $$

where \( \bar{\vx} \) denotes the mean of the data points \( \vx_1, \ldots, \vx_\nunlabeled \) as

$$ \bar{\vx} = \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \vx_\nunlabeledsmall $$

Reconstruction error in terms of data covariance

We are now ready to calculate the per data point approximation error using the form of \( \vx_\nunlabeledsmall \) from Equation \eqref{eqn:proj-xvu}.

\begin{align} \vx_\nunlabeledsmall - \tilde{\vx}_\nunlabeledsmall &= \left[ \sum_{\ndimsmall = 1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall\right] - \left[\sum_{\ndimsmall = 1}^{M} z_{\nunlabeledsmall\ndimsmall} \vu_\ndimsmall + \sum_{\ndimsmall = M+1}^{\ndim} b_{\ndimsmall} \vu_\ndimsmall \right] \\\\ &= \left[ \sum_{\ndimsmall = 1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall\right] - \left[\sum_{\ndimsmall = 1}^{M} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall + \sum_{\ndimsmall = M+1}^{\ndim} \left(\bar{\vx}^T \vu_\ndimsmall\right)\vu_\ndimsmall \right] \\\\ &= \left[ \sum_{\ndimsmall=1}^{M} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall - \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall \right] + \left[\sum_{\ndimsmall = M+1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall - \left(\bar{\vx}^T \vu_\ndimsmall\right)\vu_\ndimsmall \right]$$ \\\\ &= \left[\sum_{\ndimsmall = M+1}^{\ndim} \left(\vx_\nunlabeledsmall^T \vu_\ndimsmall\right) \vu_\ndimsmall - \left(\bar{\vx}^T \vu_\ndimsmall\right)\vu_\ndimsmall \right] \end{align}

Substituting this result into the loss, we get the loss in terms of the basis vectors \( \vu \) only.

\begin{align} \loss &= \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \sum_{\ndimsmall=M+1}^{\ndim} \left( \vx_\nunlabeledsmall^T\vu_\ndimsmall - \bar{\vx}^T\vu_\ndimsmall\right)^2 \\\\ &= \sum_{\ndimsmall=M+1}^{\ndim} \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \left[ \left(\vx_\nunlabeledsmall - \bar{\vx} \right)^T \vu_\ndimsmall \right]^2 \\\\ &= \sum_{\ndimsmall=M+1}^{\ndim} \vu_\ndimsmall^T \left[ \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \left(\vx_\nunlabeledsmall - \bar{\vx} \right)^2 \right] \vu_\ndimsmall \\\\ \implies \loss &= \sum_{\ndimsmall=M+1}^{\ndim} \vu_\ndimsmall^T \mC \vu_\ndimsmall \\\\ \end{align}

where, \( \mC \) is the covariance of the data points

$$ \mC = \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^\nunlabeled \left(\vx_\nunlabeledsmall - \bar{\vx} \right)^T\left(\vx_\nunlabeledsmall - \bar{\vx} \right) $$

Fitting the PCA: constrained optimization

If we directly try to minimize the unconstrained loss above, we will get the vacuous result \( \vu_\ndimsmall = 0 \), for all \( \ndimsmall = M+1,\ldots,\ndim \). Therefore, we need to include the constraint on the vectors \( \vu_\ndimsmall \) — that they are orthonormal and therefore \( \vu_\ndimsmall^T \vu_\ndimsmall = 1 \).

Including such a constraint with a Lagrange multiplier \( \gamma \) into the loss \( \loss \), we get our optimization criteria

\begin{align} \loss_\text{constrained} &= \sum_{\ndimsmall=M+1}^{\ndim} \left[ \vu_\ndimsmall^T \mC \vu_\ndimsmall + \gamma_\ndimsmall (1 - \vu_\ndimsmall^T\vu_\ndimsmall) \right] \\\\ \end{align}

Turns out, this optimization function can be minimized when the basis vectors \( \vu_\ndimsmall \) are chosen to be the eigenvectors of the covariance matrix \( \mC \), so that

$$ \mC\vu_\ndimsmall = \lambda_\ndimsmall\vu_\ndimsmall $$

Distortion of the PCA

Substituting the relation \( \mC\vu_\ndimsmall = \lambda_\ndimsmall\vu_\ndimsmall \) into the loss, we get

\begin{align} \loss_\text{constrained} &= \sum_{\ndimsmall=M+1}^{\ndim} \left[ \lambda_\ndimsmall \vu_\ndimsmall^T\vu_\ndimsmall + \gamma_\ndimsmall (1 - \vu_\ndimsmall^T\vu_\ndimsmall) \right] \\\\ &= \sum_{\ndimsmall=M+1}^{\ndim} \lambda_\ndimsmall \end{align}

This overall approximation error is known as the distortion of the PCA. It is simply the sum of the eigenvalues of the covariance matrix of the data being fit.

More specifically, it is the sum of those eigenvalues that correspond to the eigenvectors that do not appear among the reduced dimensions. Those eigenvectors that were ignored.

This means, to reduce the distortion of the PCA, we should ignore the dimensions \( M+1,\ldots,\ndim \) that correspond to the lowest eigenvalues of the covariance matrix. In other words, the dimensions chosen by the PCA should correspond to the eigenvectors with the largest eigenvalues.

The variance perspective

Note that the unconstrained loss we defined in terms of the covariance matrix is \begin{align} \loss &= \sum_{\ndimsmall=M+1}^{\ndim} \vu_\ndimsmall^T \mC \vu_\ndimsmall \end{align}

We also studied that this loss, known as the distortion error is minimized when we choose to discard the directions corresponding to the lowest eigenvalues of the covariance matrix.

Thus, if we were reducing the input data to a single dimension, we will be choosing the direction that corresponds to the largest eigenvalue of the data covariance matrix. And what is that direction?

That is the direction in the orthonormal basis where the variance in the projected data is the maximum. Why? Let's find out.

Suppose we project the input data to a single dimension, \( M = 1 \), represented by the eigenvector \( \vu_1 \) corresponding to the largest eigenvalue \( \lambda_1 \) of the data covariance matrix \( \mC \). The projection of \(\vx_\nunlabeledsmall\) along \( \vu_1 \) is \( \vu_1^T\vx_\nunlabeledsmall \).

If \( \bar{\vx} \) denotes the mean of the input data, the variance of the projected data is simply

\begin{align} \text{Var}_{\text{projection}} &= \frac{1}{\nunlabeled} \sum_{\nunlabeledsmall=1}^{\nunlabeled} \left[\vu_1^T\vx_\nunlabeledsmall - \vu_1^T \bar{\vx} \right]^2 \\\\ &= \vu_1^T \mC \vu_1 \\\\ &= \vu_1^T \lambda_1 \vu_1 \\\\
\implies\text{Var}_{\text{projection}} &= \lambda_1 \\\\
\end{align}

where, we have used the properties of the eigenvector \( \vu_1 \) that

  1. \( \mC\vu_1 = \lambda_1\vu_1 \)
  2. \( \vu_1^T\vu = 1 \), as it is an eigenvector.

This means, the variance of the projected input data is the eigenvalue of the chosen eigenvector of the data covariance matrix. In this case, the variance of the projection is equal to \( \lambda_1 \), the largest eigenvalue. If we had chosen any other eigenvector of the data covariance matrix for our projection, the variance would have been lower.

In other words, we have chosen to project the data onto a direction that has the largest variance. From this perspective, the PCA chooses to project the input data along the \( M \) directions with the largest variance.

Practical application

A typical practical challenge with application of PCA is the computational complexity of eigendecomposition. The data covariance matrix is a \( \ndim \times \ndim \) matrix. As we have studied in our comprehensive linear algebra article, the eigendecomposition on this matrix is of the order of \( \BigOsymbol(\ndim^3) \). This can be quite prohibitive for many high-dimensional applications.

The key idea to overcome this challenge involves the observation that we do not need to compute all the eigenvectors of the data covariance matrix. Only the \( M \) largest eigenvectors need to be computed for the projection. And in dimensionality reduction, it is typically the case that \( M \) is much smaller than \( \ndim \). This can be done much efficiently by sequential approaches to discovering eigenvectors, such as the power method that has a computational complexity of \( \BigOsymbol(M\ndim^2) \).

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Subscribe for article updates

Stay up to date with new material for free.