Introduction to Fisher's linear discriminant analysis

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Linear discriminant analysis
        Machine Learning
      

Introduction

Linear discriminant analysis is a linear classification approach. The development of linear discriminant analysis follows along the same intuition as the naive Bayes classifier. It results in a different formulation from the use of multivariate Gaussian distribution for modeling conditional distributions.

Prerequisites

To understand linear discriminant analysis, we recommend familiarity with the concepts in

Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
Gaussian distribution: The underlying class-conditional distribution for this model.
Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Naive Bayes classifier: For some understanding of the similarity of the motivation behind the model.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_N] $.

We need to assign it to one of the $ M $ classes $ C_1, C_2, \ldots, C_M $ depending on the values of the $ N $ features .

Start with the Bayes rule

In the probabilistic sense, we need to discover the probability of the instance belonging to one of these classes. That is, if we can calculate $ P(C_m | \vx) $ for all the classes, we can assign the instance to the class with the highest probability. Thus, the predicted class will be

\begin{equation} \hat{y} = \argmax_{m \in \set{1,\ldots,M}} P(C_m | \vx) \label{eqn:class-pred} \end{equation}

The conditional probability $ P(C_m|\vx) $ for each class is computed using the Bayes rule.

\begin{equation} P(C_m | \vx) = \frac{P(\vx | C_m) P(C_m)}{P(\vx)} \label{eq:class-conditional-prob} \end{equation}

In this equation, $P(C_m) $ is the class-marginal probability.

In Equation \eqref{eq:class-conditional-prob}, the term $ P(\vx) $ is the marginal probability of the instance $ \vx $. Since this will be the same across all the classes, we can ignore this term.

Now, they key quantity remaining is $ P(\vx|C_m) $, the class-conditional density of $ \vx $. Up until here, the motivation is similar to that of the naive Bayes classifier. In the case of the naive Bayes classifier, we make the naive assumption of feature-wise splitting the class-conditional density of $ \vx $.

In the case of linear discriminant analysis, we do it a bit differently.

Multivariate Gaussian as class-conditional density

In the case of linear discriminant analysis, we model the class-conditional density $ P(\vx | C_m) $ as a multivariate Gaussian.

$$ P(\vx|C_m) = \frac{1}{\sqrt{2\pi |\mSigma_m|}} \expe{-\frac{1}{2}(\vx - \vmu_m)^T \mSigma_m^{-1} (\vx - \vmu_m)} $$

Here, $ \vmu_m $ is the mean of the training examples for the class $ m $ and $ \mSigma_m $ is the covariance for those training examples.

In the case of linear discriminant analysis, the covariance is assumed to be the same for all the classes. This means, $ \mSigma_m = \mSigma, \forall m $.

In comparing two classes, say $ C_p $ and $ C_q $, it suffices to check the log-ratio

$$ \log \frac{P(C_p | \vx}{P(C_q | \vx)} $$

Let's look at this log-ratio in further detail by expanding it with appropriate substitutions.

\begin{align} \log \frac{P(C_p | \vx)}{P(C_q | \vx)} &= \log \frac{P(C_p)}{P(C_q)} + \log \frac{P(\vx|C_p)}{P(\vx|C_q)} \\\\ &= \log\frac{P(C_p)}{P(C_q)} - \frac{1}{2}(\vmu_p + \vmu_q)^T \mSigma^{-1} (\vmu_p - \vmu_q) + \vx^T \mSigma^{-1}(\vmu_p - \vmu_q) \label{eqn:log-ratio-expand} \end{align}

This equation is linear in $ \vx $, hence the name linear discriminant analysis.

The normalizing factors in both probabilities cancelled in the division since they were both $ \sqrt{2\pi |\mSigma|} $. Also, the square-term in both was $ \vx^T\mSigma\vx $ and got cancelled, resulting in the linear term based classifier. Both these cancellation will not happen if $ \mSigma_p \ne \mSigma_q $, an extension known as quadtratic discriminant analysis. Of course, quadratic discriminant analysis is not a linear classifier then, due to the presence of square terms $ \vx^T(\mSigma_p + \mSigma_q)\vx $.

Predictive model

The prediction follows from the following three conditions on the log-ratio in Equation \eqref{eqn:log-ratio-expand}.

If the ratio is greater than 0, then the prediction is class $ C_p $.
If less than 0, it is class $ C_q $.
If the log-ratio is zero, then the instance lies on the decision-boundary between the two classes.

From Equation \eqref{eqn:log-ratio-expand}, we see that each class $ m $ contributes the following term to the equaiton.

$$ \delta_m(\vx) = \vx^T\mSigma^{-1}\vmu_m - \frac{1}{2}\vmu_m^T\mSigma^{-1}\vmu_m + \log P(C_m) $$

This linear formula is known as the linear discriminant function for class $ m $. Equipped with this, the prediction can be further summarized as

$$ \yhat = \argmax_{m} \delta_m(\vx) $$

Training linear discriminant analysis

Training a linear discriminant analysis model requires the inference of three parameter types — class priors $ P(C_m) $, class conditional means, $ \vmu_m $, and the common covariance $ \mSigma $.

The priors $ P(C_m) $ is estimated as the fraction of training instances that belong to the class $ C_m $.

$$ P(C_m) = \frac{\text{Number of training instances belonging to } C_m}{\text{Total number of training examples}} $$

The mean of the class-conditional density for class $ m $, that is $ \vmu_m $, is computed as

$$ \vmu_m = \frac{1}{L_m} \sum_{y_i = C_m} \vx_i $$

where, $ L_m $ is the number of labeled examples of class $ C_m $ in the training set.

The common covariance, $ \mSigma $, is computed as

$$ \mSigma = \frac{1}{L-M} \sum_{m=1}^{M} \sum_{y_i = C_m} \sum_{i} (\vx_i - \vmu_m)(\vx_i - \vmu_m)^T $$

Training demo: binary classification

Let us fit a linear discriminant analysis model to some training data. In the first plot we show the learned classification boundary. In the other, we show the probability contours of the two Gaussian distributions.

LDA classification boundary

Likelihood contours of the two classes inferred by LDA

Being a linear classifier, the trained model comfortably separates linearly separable classes. However, it is ineffective at non-linearly separable scenarios.

Note that by design the covariance is exactly the same for both classes, but the means differ.

Dealing with feature types

Note that the predictive model involves the calculations of class-conditional means and the common covariance matrix. This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.

Multiclass classification

Dealing with multiclass problems with linear discriminant analysis is straightforward. In the development of the model, we never made any simplifying assumption that necessitates a binary classification scenario. As we explained in the section on predictive model, the unlabeled instance gets assigned to the class $ C_m $ with the maximum value of the linear disriminant function $ \delta_m(\vx) $.

Training demo: Multiclass classification

Let us fit a linear discriminant analysis model to some multiclass training data. In the first plot we show the learned classification boundary. In the other, we show the probability contours of the two Gaussian distributions.

Multiclass LDA classification boundary

Likelihood contours of classes inferred by LDA

Thus, just inferring the mean for each class is sufficient to extend LDA to the multiclass setting.

Again, note that by design the covariance is exactly the same for all the classes, but the means differ.

Number of parameters

For linear discriminant analysis, altogether, there are $ M $ class priors, $ M $ class-conditional means, and 1 shared covariance matrix. For the $ N $-dimensional feature space, each mean is $ N$-dimensional and the covariance matrix is $ N \times N $ in size. This results in $ M + M\times N + N\times N $ total parameters, or $ \BigOsymbol( M \times (N+1) ) $, if $ M > N $.

In the case of quadratic discriminant analysis, there will be many more parameters, $ (M-1) \times \left(N (N+3)/2 + 1\right) $.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Let's connect

Please share your comments, questions, encouragement, and feedback.