# Hidden Markov models

## Introduction

The hidden Markov model (HMM) is a supervised machine learning approach for applications involving sequential observations. Before the advent of deep learning approaches, it was one of the most popular and strong approach for a wide variety of applications such as speech recognition, natural language processing, on-line handwriting recognition, and biological sequence analysis, such as those involving protein or DNA data.

## Prerequisites

To understand hidden Markov models (HMM), we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

HMMs work with sequential observations. A typical sequence mining task suitable for HMM application is as follows: Given a sequence of observations, determine the sequence of states that are most likely responsible for generating the observed sequence.

Consider a multivariate $\ndim$-dimensional observation represented as the vector $\vx$. Given a sequence of $\nlabeled$ such observations $\seq{\vx_1,\ldots,\vx_\nlabeled}$, the goal is the discover the sequence of states $\seq{s_1,\ldots,s_\nlabeled}$ that generated these observations.

It is common practice to denote each state as a latent multinomial discrete variable $\vz$, a $K$-dimensional binary vector to denote the $K$ discrete states. Thus, the sequence of states is a sequence of these latent variables, denoted as $\seq{\vz_1,\ldots,\vz_\nlabeled}$. We will follow the same convention in this article.

## Intuition

The HMM works on the assumption of the following generative process.

1. Start from an initial state.
2. Generate some observation based on the current state.
3. Move to the next state.
4. Go back to step 2 and continue till termination.

This generative process is quite evident in some practical examples

• In natural language processing, the grammar dictates the parts of speech (states) that will follow each other. The observation (words) will be dependent on the current state, finally resulting in the full coherent sentence (sequence).
• In speech recognition, the states are words and the utterences (rather the resultant sounds) are the observations. The goal is to infer the sentence (sequence of words) that resulted in the observed sounds.

This process suggests that to model an HMM, we need 3 quantities.

1. A starting state. Let's represent this as $\vz_1$
2. A distribution to generate an observation $\vx$ based on the current state $\vz$. This is the probability $p(\vx|\vz)$.
3. A transition probability of moving from a state $\dash{\vz}$ to $\vz$. This is the probability $p(\vz|\dash{\vz})$.

Learning an HMM involves the fitting these probabilities to the training set. Prediction involves discovering the sequence of states that best explains the observations of interest.

## Why hidden? Why Markov?

The discrete state variables, $\vz$, are unobserved. The are latent or hidden. This explains the hidden in the name of these models.

We have studied the Markov assumption in depth in our articles on Markov chains and Bayesian networks in the context of d-separation. In a probabilistic graphical model, the Markov assumption states that the conditional distribution of a variable is independent of all other variables in the graph if the parent nodes of that variable are known.

In the case of HMMs, we are making the Markov assumption in two cases.

1. The observation $\vx$ in the current state $\vz$ is only dependent on that state. It is independent of all other states and observations. In other words, the information to generate the current observation is all encoded in the current state. This means,

$$p(\vx_\nlabeledsmall|\vx_1,\ldots,\vx_{\nlabeledsmall-1}, \vx_{\nlabeledsmall+1},\ldots,\vx_{\nlabeled}, \vz_1, \ldots, \vz_{\nlabeledsmall-1}, \vz_{\nlabeledsmall+1},\ldots,\vz_{\nlabeled}) = p(\vx_\nlabeledsmall|\vz_\nlabeledsmall)$$

1. The current state also encodes the complete information necessary to generate the state in the next step. The next step is independent of all of the previous steps or the upcoming states or observations. This means,

$$p(\vz_\nlabeledsmall|\vx_1,\ldots,\vx_{\nlabeledsmall-1}, \vx_{\nlabeledsmall+1},\ldots,\vx_{\nlabeled}, \vz_1, \ldots, \vz_{\nlabeledsmall-1}, \vz_{\nlabeledsmall+1},\ldots,\vz_{\nlabeled}) = p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1})$$

Hence, hidden Markov model.

## The emission probabilities $p(\vx|\vz)$

Let's first understand the nature and modeling of the emission probability $p(\vx|\vz)$ — the probability of generating an observation $\vx$ given the current state $\vz$.

The number of states being modeled is $K$. So, $\vz$ is a $K$-dimensional binary vector. Moreover, we effectively have $K$ conditional distributions $p(\vx|\vphi_k)$ for $k=1,\ldots,K$ to model. Here, $\vphi_k$ represents the set of parameters governing the conditional distribution These conditional probabilities can be modeled with appropriate distributions; for example with a Gaussian distribution if the observations are continuous or with conditional probability tables if the observations are discrete. In the case of Gaussian distribution, $\vphi_k$ will represent the mean and variance of the distribution, for example.

The latent variable $\vz$ is a binary vector. If the $k$-th bit is 1, that is if $z_{\nlabeledsmall k} = 1$, then it implies that the $k$-th state is active in the current step. This means, the emission probability can be factorized as

$$p(\vx_\nlabeledsmall|\vz_\nlabeledsmall,\vphi) = \prod_{k=1}^K p(\vx_\nlabeledsmall|\vphi_k)^{z_{\nlabeledsmall k}} \label{eqn:emission-probability}$$

## The transition probability $p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1})$

Now, let's understand the nature of the transition probability $p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1})$ — the probability of going to the next state from the current state.

In a hidden Markov model, the states are discrete variables. This means, the transition probability is always modeled as a conditional probability table.

More specifically, it is represented as a matrix $\mA$, a $K \times K$ matrix. The element $A_{ij}$, the element in the $i$-th row and $j$-th column of the matrix $\mA$, denotes the probability of moving from the state $i$ to the state $j$. In terms of our binary latent variable representation, this means

$$A_{ij} = p(z_{\nlabeledsmall j} = 1 |z_{\nlabeledsmall-1, i} = 1)$$

Being a probability, it is the case that

1. $0 \le A_{ij} \le 1$
2. $\sum_{j=1}^{K} A_{ij} = 1$

Similar to our formulation for emission probabilities in Equation \eqref{eqn:emission-probability}, the transition probability can be factored as

$$p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1},\mA) = \prod_{i=1}^K \prod_{j=1}^K A_{ij}^{z_{\nlabeledsmall-1, i}z_{\nlabeledsmall j}} \label{eqn:transition-probability}$$

## The initial probability $p(\vz_1)$

The initial state $\vz_1$ is special since it does not have a parent. Thus, this initialization probability is not represented in the transition probability matrix $\mA$.

Instead, we must learn the probability of starting in any state $k=1,\ldots,K$ as the initialization probability $\tau_k$. Again, being discrete, the initialization probability of each state is also modeled using a probability table $\vtau$.

Then, the probability of the first state can be written as

$$p(\vz_1|\vtau) = \prod_{k=1}^K \tau_k^{z_{1 k}} \label{eqn:initial-probability}$$

## Joint probability of an HMM

Consider a sequence $\mX = \seq{\vx_1,\ldots,\vx_\nlabeled}$ and the corresponding sequence of hidden states $\mZ = \seq{\vz_1,\ldots,\vz_\nlabeled}$. Based on our formulation of initial probability, the transition probabilities and the emission probabilities, the joint probability distribution of the HMM is

$$p(\mX,\mZ|\mTheta) = p(\vz_1|\vtau)p(\vx_1|\vz_1) \left[\prod_{\nlabeledsmall=2}^\nlabeled p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1}) p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \right] \label{eqn:joint-probability}$$

Here, $\mTheta$ denotes the set of parameters governing the model. This means, $\mTheta = \set{\vtau, \mA, \vphi}$.

## Training a hidden Markov model

Training an HMM involves learning the parameters $\mTheta$. As with most probabilistic models, the parameters of the model are fit via maximum likelihood estimation (MLE) on the likelihood of the training data.

Since the latent variables are never observed, the MLE should not be maximized to the joint probability. Instead, the likelihood function is obtained by marginalizing over the latent variables

$$p(\mX|\mTheta) = \sum_{\mZ} p(\mX,\mZ|\mTheta)$$

MLE with this likelihood function is not straightforward. We cannot simply set the derivatives with respect to the parameters to zero and arrive at their optimal values. Here's why.

The joint probability $p(\mX,\mZ|\mTheta)$, defined in Equation \eqref{eqn:joint-probability} is a product over steps. The summation over these joint probabilities in the likelihood function does not factorize over the steps by isolating terms involving a single variable, say $\vz_\nlabeledsmall$.

We have faced a similar problem before in the case of Gaussian mixture models (GMM). We dealt with that challenge with an expectation maximization (EM) approach. We will do the same in the case of HMM.

## Expectation-maximization for HMM

The EM approach to train a HMM works as follows:

1. Start with randomly initialized values for the parameters $\mTheta = \set{\vtau, \mA, \vphi}$.
2. In the E step, compute the posterior distribution of the latent variables based on the current values of the parameters — $p(\mZ|\mX,\Theta)$
3. Compute the log-likelihood of the data based on the distribution of $\mZ$ inferred in step 2. But with a twist. Since we do not know the optimal parameters, the likelihood should be written in terms of the newer version of the parameters $\mTheta_\text{new}$. The log-likelihood is written as $$Q(\mTheta_\text{new}, \mTheta) = \sum_{\mZ} p(\mZ|\mX,\Theta) \ln p(\mX,\mZ|\mTheta_{\text{new}}) \label{eqn:q-log-lik}$$
4. In the M step, maximize the above equation with respect to the newer version of parameters $\mTheta_{\text{new}}$ to get updated values of $\vtau, \mA$ and $\vphi$.
5. Re-label these new parameters as $\mTheta$ and repeat from step 2 till convergence is achieved in the log-likelihood.

In the case of HMM, the EM approach to training is known as the the Baum-Welch algorithm or the forward-backward algorithm for the reasons that will be clear in the upcoming sections.

The maximization with respect to parameters is dependent on the specific instantiation of the conditional distribution. It is straightforward. The key challenge is efficiently computing the E step. We will study this next.

## The E-step

In the E-step, we are estimating the posterior distribution over the latent variables and then computing the expectation of the log-likelihood as $Q(\mTheta_{\text{new}}, \mTheta)$. Substituting the value of the joint likelihood from Equation \eqref{eqn:joint-probability}, we get

\begin{align} Q(\mTheta_\text{new}, \mTheta) &= \sum_{\mZ} p(\mZ|\mX,\Theta) \ln p(\mX,\mZ|\mTheta_{\text{new}}) \\\\ &= \sum_{\mZ} p(\mZ|\mX,\Theta) \ln \left\{ p(\vz_1|\mX,\mTheta_\text{new})p(\vx_1|\vz_1,\mX,\mTheta_{\text{new}}) \left[\prod_{\nlabeledsmall=2}^\nlabeled p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1},\mX,\mTheta_{\text{new}}) p(\vx_\nlabeledsmall|\vz_\nlabeledsmall,\mX,\mTheta_{\text{new}}) \right] \right\} \\\\ &= \sum_{\mZ} p(\mZ|\mX,\Theta) \ln p(\vz_1|\mX,\mTheta_\text{new}) \\\\ &~~~~ + \sum_{\mZ} \left[\sum_{\nlabeledsmall=2}^\nlabeled p(\mZ|\mX,\Theta) \ln p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1},\mX,\mTheta_\text{new}) \right] \\\\ &~~~~ + \sum_{\mZ} \left[\sum_{\nlabeledsmall=1}^\nlabeled p(\mZ|\mX,\Theta) \ln p(\vx_\nlabeledsmall|\vz_\nlabeledsmall,\mX,\mTheta_\text{new}) \right] \label{eqn:q-log-lik-2} \end{align}

To make the equation less intimidating and concise, let's introduce some new variables to represent more compound quantities.

## Notational convenience: $\gamma(\vz)$

First, we represent the marginal distribution of the latent variable $\vz_\nlabeledsmall$ as $\gamma(\vz_\nlabeledsmall)$.

$$\gamma(\vz_\nlabeledsmall) = p(\vz_\nlabeledsmall | \mX, \mTheta)$$

We also use $\gamma(z_{\nlabeledsmall k})$ to denote the probability of the $k$-th bit being one in the vector $\vz_\nlabeledsmall$. Because the expectation of a binary valued variable is equal to its probability, it is the case that

$$p(z_{\nlabeledsmall k} = 1 | \mX, \mTheta) = \gamma(z_{\nlabeledsmall k}) = \expect{}{z_{\nlabeledsmall k}} = \sum_{\vz} \gamma(\vz) z_{\nlabeledsmall k}$$

With this representation, the first term of $Q(\mTheta_{\text{new}}, \mTheta)$, can be written as

\begin{align} \sum_{\mZ} p(\mZ|\mX,\Theta) \ln p(\vz_1|\mX,\mTheta_\text{new}) &= \sum_{k=1}^K \gamma(z_{1 k}) \ln \tau_k \label{eqn:q-log-lik-term-1} \end{align}

Similarly, the third term in $Q(\mTheta_{\text{new}}, \mTheta)$ can be written as

\begin{align} &\sum_{\mZ} \left[\sum_{\nlabeledsmall=1}^\nlabeled p(\mZ|\mX,\Theta) \ln p(\vx_\nlabeledsmall|\vz_\nlabeledsmall,\mX,\mTheta_\text{new}) \right] \\\\ &~~~~ = \sum_{\nlabeledsmall=1}^\nlabeled \sum_{k=1}^K \gamma(z_{1 k}) \ln p(\vx_\nlabeledsmall|\vphi_k) \label{eqn:q-log-lik-term-3} \end{align}

## Notational convenience: $\xi(\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall)$

Next, we represent the joint posterior distribution of two successive latent variables as $\xi(\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall)$

$$\xi(\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall) = p(\vz_{\nlabeledsmall-1}, \vz_\nlabeledsmall | \mX, \mTheta)$$

Just as with $\gamma$, we will also use $\xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j})$ to denote the joint probability that the $i$-th bit of $\vz_{\nlabeledsmall-1}$ and the $j$-th bit of $\vz_\nlabeledsmall$ is 1. Since $\vz$ is binary, the expectation is equal to the probability, resulting in

$$\xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}) = \expect{}{z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}} = \sum_{\vz} \gamma(\vz) z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}$$

Using this notation, we can represent the second term of the log-likelihood $Q(\mTheta_{\text{new}}, \mTheta)$ as

\begin{align} &\sum_{\mZ} \left[\sum_{\nlabeledsmall=2}^\nlabeled p(\mZ|\mX,\Theta) \ln p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1},\mX,\mTheta_\text{new}) \right] \\\\ &~~~~ = \sum_{\nlabeledsmall=2}^\nlabeled \sum_{i=1}^K \sum_{j=1}^K \xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}) \ln A_{ij} \label{eqn:q-log-lik-term-2} \end{align}

## Same $Q$, new notation

Using the terms we calculated earlier, we can re-write the log-likelihood as

\begin{align} Q(\mTheta_\text{new}, \mTheta) &= \sum_{k=1}^K \gamma(z_{1 k}) \ln \tau_k \\\\ &~~~~+ \sum_{\nlabeledsmall=2}^\nlabeled \sum_{i=1}^K \sum_{j=1}^K \xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}) \ln A_{ij} \\\\ &~~~~+ \sum_{\nlabeledsmall=1}^\nlabeled \sum_{k=1}^K \gamma(z_{1 k}) \ln p(\vx_\nlabeledsmall|\vphi_k) \label{eqn:q-log-lik-new} \end{align}

Now, for the E-step, we have to solve for $\gamma$ and $\xi$.

## Solving for $\gamma(\vz)$

Using the Bayes theorem, we can expand the definition of $\gamma(\vz)$ as follows (we will omit $\Theta_\text{new}$ to reduce clutter, but it is there!):

\begin{align} \gamma(\vz_\nlabeledsmall) &= P(\vz_\nlabeledsmall|\mX) \\\\ &= \frac{p(\mX|\vz_\nlabeledsmall)p(\vz_\nlabeledsmall)}{p(\mX)} \\\\ &= \frac{p(\vx_1,\ldots,\vx_\nlabeledsmall|\vz_\nlabeledsmall)p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_\nlabeledsmall)p(\vz_\nlabeledsmall)}{p(\mX)} \\\\ &= \frac{p(\vx_1,\ldots,\vx_\nlabeledsmall,\vz_\nlabeledsmall)p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_\nlabeledsmall)}{p(\mX)} \\\\ &= \frac{\alpha(\vz_\nlabeledsmall)\beta(\vz_\nlabeledsmall)}{p(\mX)} \label{eqn:solving-gamma} \end{align}

Here, we have introduced two new terms for brevity, defined as

\begin{align} \alpha(\vz_\nlabeledsmall) &= p(\vx_1,\ldots,\vx_\nlabeledsmall,\vz_\nlabeledsmall) \\\\ \beta(\vz_\nlabeledsmall) &= p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_\nlabeledsmall) \end{align}

Great! The solution for $\gamma(\vz_\nlabeledsmall)$ can now be found, if we can solve for $\alpha(\vz_\nlabeledsmall)$ and $\beta(\vz_\nlabeledsmall)$.
This is an unending notational nightmare, but we must endure.

## Solving for $\alpha(\vz_\nlabeledsmall)$

\begin{align} \alpha(\vz_\nlabeledsmall) &= p(\vx_1,\ldots,\vx_\nlabeledsmall,\vz_\nlabeledsmall) \\\\ &= p(\vx_1,\ldots,\vx_\nlabeledsmall|\vz_\nlabeledsmall)p(\vz_\nlabeledsmall) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) p(\vx_1,\ldots,\vx_{\nlabeledsmall-1}|\vz_\nlabeledsmall)p(\vz_\nlabeledsmall) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) p(\vx_1,\ldots,\vx_{\nlabeledsmall-1},\vz_\nlabeledsmall) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \sum_{\vz_{\nlabeledsmall-1}} p(\vx_1,\ldots,\vx_{\nlabeledsmall-1},\vz_{\nlabeledsmall-1}, \vz_\nlabeledsmall) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \sum_{\vz_{\nlabeledsmall-1}} p(\vx_1,\ldots,\vx_{\nlabeledsmall-1},\vz_\nlabeledsmall)|\vz_{\nlabeledsmall-1})p(\vz_{\nlabeledsmall-1}) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \sum_{\vz_{\nlabeledsmall-1}} p(\vx_1,\ldots,\vx_{\nlabeledsmall-1}|\vz_{\nlabeledsmall-1}) p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1})p(\vz_{\nlabeledsmall-1}) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \sum_{\vz_{\nlabeledsmall-1}} p(\vx_1,\ldots,\vx_{\nlabeledsmall-1},\vz_{\nlabeledsmall-1}) p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1}) \\\\ &= p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) \sum_{\vz_{\nlabeledsmall-1}} \alpha(\vz_{\nlabeledsmall-1}) p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1}) \\\\ \end{align}

where, we have used the definition of $\alpha(\vz_{\nlabeledsmall-1})$ to arrive at the recursion relation for $\alpha(\vz_\nlabeledsmall)$.

This recursive relation is initiated with

\begin{align} \alpha(\vz_1) &= p(\vx_1,\vz_1) \\\\ &= p(\vz_1)p(\vx_1|\vz_1) \\\\ &= \prod_{k=1}^K \left[ \tau_k p(\vx_1|\phi_k)\right]^{z_{1k}} \label{eqn:alpha-vz1} \end{align}

Thus, calculating $\alpha$ should be easy for known parameter values of $\vphi, \vtau,$ and $\mA$.

Note that for calculating $\alpha$ we move forward through the sequence, starting with $\alpha(\vz_1)$, all the way up to $\alpha(\vz_\nlabeled)$.

## Solving for $\beta(\vz_\nlabeledsmall)$

A similar strategy can be used to find a recurrence relation for $\beta(\vz_\nlabeledsmall)$.

\begin{align} \beta(\vz_\nlabeledsmall) &= p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_\nlabeledsmall) \\\\ &= \sum_{\vz_{\nlabeledsmall+1}} p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled,\vz_{\nlabeledsmall+1}|\vz_\nlabeledsmall) \\\\ &= \sum_{\vz_{\nlabeledsmall+1}} p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_{\nlabeledsmall+1},\vz_\nlabeledsmall) p(\vz_{\nlabeledsmall+1}|\vz_\nlabeledsmall) \\\\ &= \sum_{\vz_{\nlabeledsmall+1}} p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_{\nlabeledsmall+1}) p(\vz_{\nlabeledsmall+1}|\vz_\nlabeledsmall) \\\\ &= \sum_{\vz_{\nlabeledsmall+1}} p(\vx_{\nlabeledsmall+2},\ldots,\vx_\nlabeled|\vz_{\nlabeledsmall+1})p(\vx_{\nlabeledsmall+1}|\vz_{\nlabeledsmall+1}) p(\vz_{\nlabeledsmall+1}|\vz_\nlabeledsmall) \\\\ &= \sum_{\vz_{\nlabeledsmall+1}} \beta(\vz_{\nlabeledsmall+1}) p(\vx_{\nlabeledsmall+1}|\vz_{\nlabeledsmall+1}) p(\vz_{\nlabeledsmall+1}|\vz_\nlabeledsmall) \\\\ \end{align}

In contrast to the forward calculation for $\alpha$, in the case of $\beta$, we have to move backward through the sequence, starting with $\beta(\vz_\nlabeled)$. We can find this initialization value by noting from Equation \eqref{eqn:solving-gamma} that

$$p(\vz_\nlabeled|\mX) = \frac{\alpha(\vz_\nlabeled)\beta(\vz_\nlabeled)}{p(\mX)}$$

This can only be correct if $\beta(\vz_\nlabeled) = 1$, providing us with the initialization value for the recurrence relation.

Owing the directional calculations of $\alpha$ (forward) and $beta$ (backward), this approach is also known as the forward-backward algorithm.

## Solving for $\xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j})$

Finally, we are ready to solve for $\xi$.

\begin{align} \xi(z_{\nlabeledsmall-1, i},z_{\nlabeledsmall j}) &= p(\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall|\mX) \\\\ &= \frac{p(\mX|\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall)p(\vz_{\nlabeledsmall-1},\vz_\nlabeledsmall)}{p(\mX)} \\\\ &= \frac{p(\vx_1,\ldots,\vx_{\nlabeledsmall-1}|\vz_{\nlabeledsmall-1})p(\vx_\nlabeledsmall|\vz_\nlabeledsmall) p(\vx_{\nlabeledsmall+1},\ldots,\vx_\nlabeled|\vz_\nlabeledsmall) p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1}) p(\vz_{\nlabeledsmall-1})}{p(\mX)} \\\\ &= \frac{\alpha(\vz_{\nlabeledsmall-1}) p(\vx_{\nlabeledsmall}|\vz_\nlabeledsmall) p(\vz_\nlabeledsmall|\vz_{\nlabeledsmall-1}) \beta(\vz_\nlabeledsmall)}{p(\mX)} \label{eqn:solving-xi} \end{align}

Thus, $\xi$ can be easily calculated with our previous calculations of $\alpha$ and $\beta$.

## Calculating $p(\mX)$

Note that the calculations of $\gamma$ and $\xi$ involve the term $p(\mX)$. Calculating this is quite straightforward.

Note from the Equation \eqref{eqn:solving-gamma}, that

\begin{align} &\gamma(\vz_\nlabeledsmall) = \frac{\alpha(\vz_\nlabeledsmall)\beta(\vz_\nlabeledsmall)}{p(\mX)} \\\\ &\implies \sum_{\vz_\nlabeledsmall} \gamma(\vz_\nlabeledsmall) = \sum_{\vz_\nlabeledsmall} \frac{\alpha(\vz_\nlabeledsmall)\beta(\vz_\nlabeledsmall)}{p(\mX)} \\\\ &\implies 1 = \sum_{\vz_\nlabeledsmall} \frac{\alpha(\vz_\nlabeledsmall)\beta(\vz_\nlabeledsmall)}{p(\mX)} \\\\ &\implies p(\mX) = \sum_{\vz_\nlabeledsmall} \alpha(\vz_\nlabeledsmall)\beta(\vz_\nlabeledsmall) \\\\ \label{eqn:solving-px} \end{align}

## Most probable state sequence

What can we do with a trained HMM? Well, we can infer the most probable hidden state sequence for given observed sequence.

$$\star{\mZ} = \argmax_{\vz_1,\ldots,\vz_\nlabeled} p(\vx_1,\ldots,\vx_\nlabeled, \vz_1,\ldots,\vz_\nlabeled)$$

To solve this optimization problem, trying out all possible values of the state sequences is computationally infeasible for many applications.

One efficient approach that incrementally discovers the most probable state sequence is known as the Viterbi algorithm. It is a dynamic programming approach that breaks down the overall problem into incremental sequence discovery. It works by only considering the most promising paths at any point in time.

Note that for any $\nlabeledsmall$, to discover the best $\vz_\nlabeledsmall$, we need to know the $K$ different paths that lead to each of the $K$ possible values of $\vz_\nlabeledsmall$. The best probability that the model will be in the $k$-th state at the $\nlabeledsmall$ step is

$$p(z_{\nlabeledsmall k}) = \max_{\vz_1,\ldots,\vz_{\nlabeledsmall-1}} p(\vx_1,\ldots,\vx_\nlabeledsmall, \vz_1,\ldots,\vz_{\nlabeledsmall-1}, z_{\nlabeledsmall k} = 1)$$

When we move to next step of discovering the best value for $\vz_{\nlabeledsmall+1}$, we only need to retain the path that lead to the best value of $\vz_\nlabeledsmall$.

By the time we reach the end of the sequence, we are left with a single unique path. We can backtrack this path to discover the best values of all the intermediate states in the sequence.