Naive Bayes classifier

Machine Learning

Introduction

The naive Bayes classifier is a generative model for classification. Before the advent of deep learning and its easy-to-use libraries, the Naive Bayes classifier was one of the widely deployed classifiers for machine learning applications. Despite its simplicity, the naive Bayes classifier performs quite well in many applications. In this article, we will cover the formulation of the model and some of its extensions.

Prerequisites

To understand the naive Bayes classifier, we recommend familiarity with the concepts in

  • Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
  • Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).

We need to assign it to one of the \( M \) classes \( C_1, C_2, \ldots, C_M \) depending on the values of the \( N \) features .

Predictive model

In the probabilistic sense, we need to discover the probability of the instance belonging to one of these classes. That is, if we can calculate \( P(C_m | \vx) \) for all classes, we can assign the instance to the class with the highest probability. Thus, the predicted class will be

\begin{equation} \hat{y} = \argmax_{m \in \set{1,\ldots,M}} P(C_m | \vx) \label{eqn:class-pred} \end{equation}

Invoking the Bayes rule

The conditional probability \( P(C_m|\vx) \) for each class is computed using the Bayes rule.

\begin{equation} P(C_m | \vx) = \frac{P(\vx | C_m) P(C_m)}{P(\vx)} \label{eq:class-conditional-prob} \end{equation}

In this equation, \(P(C_m) \) is the class-marginal probability. The marginal \( P(C_m) \) is estimated as the fraction of training instances that belong to the class \( C_m \).

$$ P(C_m) = \frac{\text{Number of training instances belonging to } C_m}{\text{Total number of training examples}} $$

In Equation \eqref{eq:class-conditional-prob}, the term \( P(\vx) \) is the marginal probability of the instance \( \vx \). Since it must belong to some class, it is sum of the probabilities of it being generated from any of the classes.

$$ P(\vx) = \sum_{m \in \set{1,\ldots,M}} P(C_m)P(\vx|C_m) $$

Now, they key quantity remaining is \( P(\vx|C_m) \). Note that \( \vx \) is \( n\)-dimensional, where \( n \) can be quite large. Depending on the value of \( n \) it may be infeasible to model a multivariate distribution, for example multivariate Gaussian, directly for the estimation of this probability. Also, the probability in such a joint multivariate model could be miniscule and the resulting calculation could suffer from underflow errors.

Naive assumption

One way to address the challenge of complexity is to adopt simplicity! In probability, a common strategy to achieve simplicity involves factoring a complicated multivariate probability into univariate probabilities of its constituents.

This means, we factorize the complicated multivariate probability \( P(\vx|C_m) \) into the class conditional-probability of each of the \( N \) constituents of \( \vx \) as follows:

$$ P(\vx|C_m) = \prod_{n \in \set{1,\ldots,N}} P(x_n|C_m) $$

In other words, we have assumed that any pair of factors \( x_i \) and \( x_j \) are independent given the class, mathematically expressed as \( x_i \perp x_j | C_m \). This class-conditional-independence is a naive assumption. Hence the name naive Bayes classifier.

Note that for each of the \( N \) factors, it will be much easier to compute \(P(x_n|C_m) \) from some univariate distribution suitable to the factor.

Training naive Bayes

Training naive Bayes is easy. The training process infers two quantitites

  • \( P(C_m) \) for all classes \( C_m \in [C_1, C_2, \ldots, C_M] \)
  • \( P(x_n | C_m) \) for all features \( x_n \in [x_1, x_2, \ldots, x_N] \), for all classes \(C_m \in [C_1, C_2, \ldots, C_M] \)

We have seen earlier that the calculation for \( P(C_m) \) is a straightforward ratio.

$$ P(C_m) = \frac{\text{Number of training instances belonging to } C_m}{\text{Total number of training examples}} $$

The calculation of \( P(x_n | C_m) \) depends on the specific univariate model being employed for the classification task at hand. We provide some scenarios here.

Binary features

Some classification scenarios involve binary features. This setting was quite popular in short text classification in the early days of machine learning. If a term was present in the text, the corresponding features was on (its value was 1), else it was off (the value was 0).

The calculation of \( P(x_n | C_m) \) is straightforward in this case. We just need to figure out the fraction of times the particular feature \( x_n \) took a certain value in the training examples belonging to the class \( C_m \).

$$ P(x_n=1 | C_m) = \frac{\text{Number of training examples with } x_n = 1 \text{ in class } C_m}{\text{Total number of training examples of class } C_m} $$

Similarly,

$$ P(x_n=0 | C_m) = \frac{\text{Number of training examples with } x_n = 0 \text{ in class } C_m}{\text{Total number of training examples of class } C_m} $$

Note that in this case, since the variable can take only one of the two mutually exclusive values, it holds that

$$ P(x_n = 1 | C_m) = 1 - P(x_n = 0 | C_m) $$

Thus, this can be usually implemented as a univariate Bernoulli distribution with only learning the success probability \( P(x_n = 1 | C_m) \).

Categorical features

When each of the features can take on one among multiple discrete values, for example, \( [\text{yes}, \text{no}, \text{maybe}] \), then the Bernolli distribution above is replaced with a multinomial distribution.

In this case, for each possible value \( x_n = v \), the probabilities are calculated as before.

$$ P(x_n=v | C_m) = \frac{\text{Number of training examples with } x_n = v \text{ in class } C_m}{\text{Total number of training examples of class } C_m} $$

Note that if there are \( K_n \) distinct categorical values \( [v_1, v_2, \ldots v_{K_n}] \) for the feature \( x_n \), then the following relationship holds.

$$ \sum_{k=1}^{K_n} P(x_n = v_k | C_m) = 1 $$

Continuous features

In the case of real-valued features, a suitable continuous distribution is used to model the probability \( P(x_n | C_m) \). Some choices are the Gaussian distribution or the uniform distribution.

As an example, in the case of the Gaussian distribution, the learning process will infer the essential parameters of the distribution --- the mean \( \mu_m \) and the variance \( \sigma_m^2 \) of the feature \( x_n \) among all examples that belong to the class \(C_m\). With these parameters, the pdf can be computed using the standard formula

\begin{equation} \pdf{x_n = v| C_m} = \frac{1}{\sqrt{2\pi \sigma_m^2}}~\exp\left( -\frac{(v - \mu_m)^2}{2\sigma_m^2}\right) \end{equation}

Note that we have replaced the probability with a probability density function, the PDF \( p(x_n = v | C_m) \), because the Gaussian distribution is a continuous distribution and pointwise probability \( P(x_n = v | C_m) \) at any value \( v \) is zero! But that is not a challenge because we only need \( p(x_n = v | C_m ) \) for comparison of relative probabilities of each class \( C_m \). The exact probability is unnecessary and the PDF suffices.

Gaussian naive Bayes training : demo

In this next interaction, understand the nature of the Gaussian naive Bayes classifier.

Gaussian naive Bayes classification boundary
Likelihood contours of the two classes inferred by the model

Note that in this case, the decision boundary looks quadratic, not linear, as we previously explained. This is because we chose different class-conditional variance matrices for each of the classes. If we choose different variance matrices, the decision boundary is quadratic, as you can see in this case. For more perspective, refer this discussion on StackExchange.

But, even such quadratic decision boundary is not sufficient for separating our non-linear classification dataset!

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.