# Kernel density estimation

## Introduction

Kernel density estimation is a kernel method. It is an unsupervised approach for estimating the probability density function at a given point when provided with several observations from the generating distribution.

## Prerequisites

To understand the kernel density estimation, we recommend familiarity with the concepts in

• Probability: A sound understanding of conditional and marginal probabilities and Bayes Theorem is desirable.
• Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
• Kernel methods

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

In density estimation, the goal of the predictive model is to estimate the probability density function at given data points.

Consider observations of the form $\vx \in \real^\ndim$ — vectors consisting of $\ndim$ features, $\vx = [x_1, x_2, \ldots, x_\ndim]$. A collection of such observations is provided in the unlabeled set $\unlabeledset = \set{\vx_1,\ldots,\vx_\nunlabeled}$.

The kernel density estimation task involves the estimation of the probability density function $f$ at a given point $\vx$.

## Motivation

A simple local estimate could just count the number of training examples $\dash{\vx} \in \unlabeledset$ in the neighborhood of the given data point $\vx$.

$$f(\vx) = \frac{\text{Number of } \dash{\vx} \in \text{Neighborhood}(\vx)}{\nunlabeled\lambda} \label{eqn:pred-local}$$

But this estimate is not smooth. Moreover, we need to define the neighborhood of each point.

## Predictive model

In kernel density estimation, we prefer a smoother Parzen estimate, facilitated by a kernel $K$.

$$f(\vx) = \frac{1}{\nunlabeled\lambda} \sum_{\nunlabeledsmall=1}^{\nunlabeled} K_\lambda(\vx,\vx_{\nunlabeledsmall}) \label{eqn:pred}$$

As with all kernel methods, the kernel is a similarity function — it assigns a high score to nearby examples and low score to distant examples. Thus, if the given point is in a highly populated region (surrounded by numerous other examples), it will receive a higher value of the density, else the density will be lower.

## The Gaussian kernel

A popular choice for the kernel is the Gaussian kernel

$$K_\lambda(\vx,\dash{\vx}) = \phi_\lambda\left(|\vx - \dash{\vx}|\right)$$

where, $\phi_\lambda(\cdot)$ is the density function for $\Gauss(0,\sqrt{\lambda})$ — a Gaussian with zero mean and standard deviation $\lambda$.

## KDE as a classifier

Classification is a straightforward application of KDE. In classification, we need to assign the given example $\vx$ to one of the $\nclass$ classes $C_1, \ldots, C_\nclass$ depending on the values of the $N$ features . To train the classifier, we are provided with a training set of labeled examples — a tuple $(\vx,y)$ consisting of the feature vector $\vx$ and the label $y \in \set{C_1,\ldots,C_\nclass}$. Denote the labeled set as $\labeledset = \set{(\vx_1,y_1),\ldots,(\vx_\nlabeled,y_\nlabeled)}$.

For kernel density classification, we assign a point to the class with the highest density for that point. For this, we first estimate the density of the given example $\vx$ in each of the classes as $f_\nclasssmall(\vx)$

$$f_\nclasssmall(\vx) = \frac{1}{\nlabeled_\nclasssmall\lambda} \sum_{\dash{\vx} \in \labeledset_\nclasssmall} K_\lambda(\vx,\dash{\vx}) \label{eqn:class-density}$$

where, $\labeledset_\nclasssmall$ is the set of training examples belonging to the class $\nclasssmall$ and $\nlabeled_\nclasssmall$ is the number of training examples of that class.

Having calculated the per class density estimate, we use the Bayes rule to calculate the conditional probability of a class for the given example as

$$P(C_m|\vx) = \frac{P(C_m)f_{\nclasssmall}(\vx)}{\sum_{\dash{\nclasssmall}}^{\nclass}P(C_\dash{m})f_{\dash{\nclasssmall}}(\vx)} \label{eqn:class-bayes-rule}$$

Here, the marginal probability of the class $P(C_m)$ is easy to calculate as

$$P(C_m) = \frac{\nlabeled_\nclasssmall}{\nlabeled}$$

Finally, the example is assigned to the class with the highest probability.

$$\hat{y} = \argmax_{m \in \set{1,\ldots,M}} P(C_m | \vx) \label{eqn:class-pred}$$

## KDE classifier: demo

In this next interaction, understand the nature of the kernel density estimation based classifier.

Understand the effect of the size of the bandwidth and kernel type on the prediction function.

Note that increasing the value of $\lambda$ leads to a smoother predictive model, as it allows for more training data points to affect the function at a point. Conversely, small values of $\lambda$ lead to high weights for nearby examples and limited impact of distant examples. This effectively results in less smooth predictor that is quite noisy.

Also note that that the Epanechnikov and tricube kernel both have limited support regions. If there are no training point within some neigbhorhood, the kernel method fails to present a good prediction. On the other hand, the Gaussian kernel has infinite support and it does not fail even if some region is devoid of training examples.

If you have studied some of our other classification demos, you will notice that the KDE-based classifier is extremely slow. This is to be expected. There is no training phase. At prediction time, the test instance is evaluated for proximity to every example in the training set to arrive at a kernel-based similarity score, leading to a computationally very expensive prediction phase.

## Dealing with feature types

Note that the predictive model involves a kernel, which in the case of Gaussian kernel requires real-valued features. Thus, kernel density estimation is straightforward for continuous features.

In the case of categorical or binary features, we need to utilize kernels that are suitable for such features.