# Gaussian mixture models

## Introduction

Gaussian mixture models (GMM), as the name implies, are a linear superposition of a mixture of Gaussian distributions. They are an effective soft clustering tool, when we wish to model the examples as being partially belonging to multiple clusters. Compare this with the rigidity of the K-means model that assigns each example to a single cluster.

## Prerequisites

To understand the Gaussian mixture model, we recommend familiarity with the concepts in

• Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
• Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

In clustering, the goal of the model is to group the available data into clusters — groups of observations that are similar in some sense.

Consider observations of the form $\vx \in \real^\ndim$ — vectors consisting of $\ndim$ features, $\vx = [x_1, x_2, \ldots, x_\ndim]$. A collection of such observations is provided in the unlabeled set $\unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled}$.

The GMM clustering algorithm needs to infer the assignments of these $\nunlabeled$ examples into $K$ groups or clusters.

## Mixture model

In a mixture model, as the name implies, each observation is assumed to be generated from a mixture of participating distributions. In the case of a Gaussian mixture model, the participants of the mixture are Gaussian distributions.

A GMM is a such a mixture of $K$ Gaussian distributions. The probability density function for such a mixture is written as follows:

$$p(\vx) = \sum_{k=1}^K P(k) \Gauss(\vx|\vmu_k,\mSigma_k) \label{eqn:gmm-pdf}$$

In this equation

1. $P(k)$ denotes the mixture proportion of the $k$-th distribution. In other words, this factor weighs the contribution of the $k$-the distribution into the mixture. The mixing proportions add up to 1, so that $\sum_{k=1}^K P(k) = 1$.
2. $\Gauss(\vx|\vmu_k,\mSigma_k)$ denotes the conditional probability of the instance $\vx$ for the $k$-th Gaussian distribution $\Gauss(\vmu_k,\mSigma_k)$. Here, $\vmu_k$ and $\mSigma_k$ are the mean and covariance of that Gaussian distribution, respectively.

## Nature of the mixture model: A demo

We understand the mixture model mathematically. Now let us study how a mixture manifests.

In this next interactive demo, you can change the relative mixing proportions $P(k_i)$ for $i = 1, 2, 3$ by affecting the appropriate slider. We normalize the mixing proportions so that they sum to 1.

For simplicity of this demo, we have restricted the covariance of each Gaussian to be an identity matrix. Thus, for all $\Sigma_k = \mI$, for all $k$.

The mean $\mu_k$ of each Gaussian can be changed in the adjoining chart by dragging the corresponding circle.

## Alternative representation

An alternative representation often found in the literature uses a $K$-dimensional one-hot vector $\vz$ to represent the cluster memberships. The vector $\vz = [z_1,\ldots,z_K]$ is a one-hot vector, in which a particular element is 1 and all the other elements are zero. Therefore, $z_k \in \set{0,1}$ for all $k$ and

$$\sum_{k=1}^K z_k = 1$$

There are $k$ possible states for the vector $\vz$ as exactly one element can be set to 1 at a time.

Thus, the mixing proportion $P(k)$ of the $k$-th Gaussian can now be presented as $p(z_k = 1)$ instead of $P(k)$. As before, it is the case that

$$\sum_{k=1}^K P(k) = \sum_{k=1}^K p(z_k = 1) = 1$$

Since $\vz$ is a one-hot vector, the overall probability of the vector $\vz$ can be written as the product

$$p(\vz) = \prod_{k=1}^K P(k)^{z_k}$$

For the same reason, the conditional probability $p(\vx|\vz)$ can also be written as

$$p(\vx|\vz) = \prod_{k=1}^{K} \Gauss(\vx|\vmu_k,\mSigma_k)^{z_k}$$

Finally, the marginal distribution of the instance $\vx$ can be now written as, just as in Equation \eqref{eqn:gmm-pdf}

$$p(\vx) = \sum_{k=1}^K p(\vz) p(\vx|\vz) = \sum_{k=1}^K P(k) \Gauss(\vx|\vmu_k,\mSigma_k)$$

## Fitting a GMM to data

From the nature of the mixture model defined in Equation \eqref{eqn:gmm-pdf}, it should be noted that fitting a GMM involves the inference of three kinds of parameters for all the participating Gaussians $k = 1, \ldots, K$.

1. the mixing proporitions $P(k)$
2. the means $\vmu_k$
3. the covariances $\Sigma_k$

As with most parametric probabilistic models, the GMM parameters are inferred using maximum likelihood estimation (MLE).

## Maximum likelihood estimation (MLE)

Given a dataset of observations, $\unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled}$, the log-likelihood of the dataset in the case of Gaussian mixture model is

$$\ln p\left(\unlabeledset|\set{P(k),\vmu_k,\mSigma_k}_{k=1}^K\right) = \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \ln \left\{ \sum_{k=1}^K P(k)\Gauss(\vx_\nunlabeledsmall | \vmu_k, \mSigma_k) \right\} \label{eqn:log-lik}$$

The best values of the parameters $\set{P(k),\vmu_k,\mSigma_k}_{k=1}^K$ are those that lead to the maximum likelihood of the observed data. And that is how we can infer these parameters — by maximizing the likelihood.

## Solving the MLE

As we have learnt in calculus, at the maximum of the log-likelihood defined in Equation \eqref{eqn:log-lik}, the derivative with respect to the parameters, say $\vmu_k$ should be 0. This means, at the maximum

\begin{align} \frac{\partial \ln p\left(\unlabeledset|\set{P(k),\vmu_k,\mSigma_k}_{k=1}^K\right)}{\partial \vmu_k} &= 0 \\\\ -\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \frac{P(k)\Gauss(\vx_\nunlabeledsmall | \vmu_k, \mSigma_k)}{\sum_{j=1}^K P(j)\Gauss(\vx_\nunlabeledsmall | \vmu_j, \mSigma_j)} \mSigma_k(\vx_\nunlabeledsmall - \vmu_k) &= 0 \label{eqn:log-lik-deriv} \end{align}

For simplicity, let's represent the fraction term in the above equation with a $\gamma_{\nunlabeledsmall k}$ as

$$\gamma_{\nunlabeledsmall k} = \frac{P(k)\Gauss(\vx_\nunlabeledsmall | \vmu_k, \mSigma_k)}{\sum_{j=1}^K P(j)\Gauss(\vx_\nunlabeledsmall | \vmu_j, \mSigma_j)} \label{eqn:gamma-uk}$$

It can be observed that $\gamma_{\nunlabeledsmall k}$ is calculating the proportion of $\vx_\nunlabeledsmall$ in the cluster $k$. Substituting this representation into the Equation \eqref{eqn:log-lik-deriv}, we get

\begin{align} -\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \mSigma_k(\vx_\nunlabeledsmall - \vmu_k) = 0 \label{eqn:log-lik-deriv-gamma} \end{align}

Being a covariance matrix, we can assume $\mSigma^{-1}$ to be nonsingular and multiply both sides of the above equation with it.

\begin{align} -\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} (\vx_\nunlabeledsmall - \vmu_k) = 0 \label{eqn:log-lik-deriv-gamma-2} \end{align}

Re-arranging terms of this equation, we get

\begin{align} \vmu_k = \frac{\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \vx_\nunlabeledsmall}{\sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k}} \label{eqn:muk-soln-first} \end{align}

Note the normalization term. It sums over all the examples in the dataset, with a weighing factor $\gamma_{\nunlabeledsmall k}$ for each example. Intuitively, we can imagine this term being the soft number of examples that belong to the cluster $k$. Thus, we can represent this term as $\nunlabeled_k$.

$$\nunlabeled_k = \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \label{eqn:u-k}$$

Given this form, the optimal value for $\vmu_k$ can be written as

\begin{align} \vmu_k = \frac{1}{\nunlabeled_k} \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} \vx_\nunlabeledsmall \label{eqn:muk-soln} \end{align}

Along similar lines, taking derivative with respect to the parameters $\mSigma_k$, we arrive at its solution as

\begin{align} \mSigma_k = \frac{1}{\nunlabeled_k} \sum_{\nunlabeledsmall = 1}^{\nunlabeled} \gamma_{\nunlabeledsmall k} (\vx_\nunlabeledsmall - \vmu_k)(\vx_\nunlabeledsmall - \vmu_k)^T \label{eqn:sigmak-soln} \end{align}

and similarly, for $P(k)$, we arrive at its solution as

\begin{align} P(k) = \frac{\nunlabeled_k}{\nunlabeled} \label{eqn:pk-soln} \end{align}

## The need for expectation-maximization

Note that the solutions for $\vmu_k, \mSigma_k,$ and $P(k)$ using MLE all contain the term $\nunlabeled_k$. The calculation of this term involves all the parameters of the model, including $\vmu_k, \mSigma_k$ and $P(k)$.

Thus, the solutions above have a big problem. They are interdependent. They are not closed form solutions!

The reason for this? In the log-likelihood, there is a summation inside the log!

This means, a derivative-based approach to optimization will not be able to decompose the summation that is inside the log. All parameters, $\set{P(k),\vmu_k,\mSigma_k}_{k=1}^K$, will appear together in that summation whether we are differentiating with respect to $P(k)$ or $\vmu_k$ or $\mSigma_k$. In other words, the summation inside the logarithm will not lead to a closed form solution for the optimization problem.

And when there is no closed form solution, we will likely have to come up with an iterative solution. An elegant iterative solution in the case of GMM is the Expectation-Maximization approach.

## Expectation-maximization for GMM

Expectation-maximization (EM) is an iterative approach that alternates between two steps — expectation and maximization — to convergence.

In the case of GMM

• In the expectation step, we will find the memberships of examples in the various clusters. For GMM, this means the proportions $\gamma_{\nunlabeledsmall k}$ according to Equation \eqref{eqn:gamma-uk}. This can help calculate the crucial normalization term $\nunlabeled_k$ using Equation \eqref{eqn:u-k}.
• In the maximization step, we will infer the values of the parameters of the model $\set{\vmu_k, \mSigma_k, P(k)}_{k=1}^K$ as defined in Equations \eqref{eqn:muk-soln}, \eqref{eqn:sigmak-soln}, and \eqref{eqn:pk-soln}.

Thus, the overall EM algorithm for fitting a GMM to a dataset is as follows.

1. Initialize the parameters $\set{\vmu_k, \mSigma_k, P(k)}_{k=1}^K$. Evaluate the initial log-likelihood of the model.
2. E step: Calculate $\gamma_{\nunlabeledsmall k}$ for all $\nunlabeledsmall \in \set{1, \ldots, \nunlabeled}$ and $k \in \set{1, \ldots, K}$. Calculate $\nunlabeled_k$.
3. M step: Calculate all the parameters $\vmu_k, \mSigma_k, P(k)$ for all $k \in \set{1, \ldots, K}$.
4. Compute log-likelihood with the most recent parameters from step 3. If it is different from the previous calculation of log-likelihood, go back to step 2, else terminate iterations.

## Gaussian mixture model: demo

In this next interactive demo, load a dataset, choose a suitable number of components, and fit the GMM to the data. We are purposely coloring the dataset to help you visualize the inherent clusters in the dataset.

Note how the number of components is a crucial hyperparameter for the GMM. Also note that sometimes, the GMM may end up with a singleton situation — a single component fitting all the data and other components rendered meaningless.

## Dealing with feature types

Note that GMM algorithm relies on the Gaussian distribution. This makes it suitable for continuous or real-valued features.

In the case of categorical features, a modeling with Gaussian distributions is not possible. Even if we first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation, it may be inadequate to represent with a GMM.

Binary variables can be better represented with other mixtures, such as those involving Bernoulli distributions instead of Gaussians.