Bias variance tradeoff

Machine Learning

Introduction

The bias-variance tradeoff is a central concept in the frequentist view of machine learning. In any machine learning task, the primary goal is to improve predictive performance on previously unobserved examples. For example, in the case of regression, we want to minimize the expected loss on unseen test examples. The bias-variance tradeoff provides an argument for choosing suitable models by decomposing the expected loss into bias and variance components.

Although conceptually important, it is of limited practical use in choosing models or model families. Nevertheless, it is crucial to understand this concept to better appreciate the need for choosing models with appropriate complexity on a particular machine learning task.

Prerequisites

To understand the bias-variance tradeoff, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

To understand the bias-variance decomposition, we will use the regression task as an example.

Suppose, the input features are \(\ndim\)-dimensional vectors; \( \vx \in \real^\ndim \), so that \(\vx = [x_1, x_2, \ldots, x_N] \). The true relationship to the targets is governed by the deterministic function \( f: \real^\ndim \to \real \), that is unknown to us. What we instead observe is the dataset \( \dataset = \set{(\vx_1,y_1), \ldots, (\vx_\ndata,y_\ndata)} \), where the target variables are related to the true outputs as

$$ y = f(\vx) + \epsilon $$

This means, in the observed data, there is always some random intrinsic noise \( \epsilon \). This could be from some zero-mean distribution with \( \sigma^2 \) variance, for example, \( \Gauss(0,\sigma^2) \).

Our goal in regression is to use available training data to guess a function \( g_\dataset: \real^\ndim \to \real \) that accurately transforms the inputs to the outputs.

$$ \yhat = g_\dataset(\vx) $$

The hat \( \hat{ } \) denotes that \( \hat{y} \) is an estimate, to distinguish it from the observation \( y \). The subscript \( \dataset \) on \( g_\dataset \) indicates that the function \( g_\dataset \) has been trained on samples from the dataset \( \dataset \).

What constitutes a good guess?

Often times, we need to choose the function \( g_\dataset(\vx) \) from different model families or from different hyperparameter settings for the same model family. For example, we may try to choose \( g_\dataset(\vx) \) from among different families such as \(K\)-nearest neighbor regression, linear regression, kernel regression, or random forest regression. Even among models of \(K\)-nearest-neighbor regression family, our model would be different depending on the value of \( K \).

How do we know if we are making a good guess? And how do we make the appropriate selection among the varied choices for \( g_\dataset \)?

Well, we choose the option with the best prediction performance — the one with the lowest prediction error.

Prediction error

We want our guess function \( g_\dataset(\vx) \) to accurately model \( y \). In other words, we want our prediction \( g_\dataset(\vx) \) to be very close to the true value \( f(\vx) \). We can measure the error in our prediction as the squared error or loss as \( \ell(y,\yhat) \).

\begin{aligned} \ell(y, \yhat) &= \left(y - \yhat\right)^2 \\\\ &= \left(f(\vx) + \epsilon - g_\dataset(\vx)\right)^2 \\\\ &= \left(f + \epsilon - g\right)^2 \\\\ \label{eqn:pred-error} \end{aligned}

In the final statement, to reduce clutter we have used shorthand \( f \) for \( f(\vx) \) and \( g \) for \( g_\dataset(\vx) \).

Expected loss

The loss in Equation \eqref{eqn:pred-error} is over a single example. What we are really interested in is a lower expected loss. In this case, the expectation is over the dataset \( \dataset \) that can be observed, since we do not have access to the true data output generating function \( f(\vx) \).

The expected loss of predictions of \( g \), with the expectation computed over the observed dataset \( \dataset \) is

\begin{align} \expect{\dataset}{\ell(y, \yhat)} &= \expect{\dataset}{f + \epsilon - g}^2 \\\\ &= \expect{}{f + \epsilon - g + \expect{}{g} - \expect{}{g}}^2 \\\\ &= \expect{}{(f - \expect{}{g}) + \epsilon + (\expect{}{g} - g)}^2 \\\\ &= \expect{}{(f - \expect{}{g})^2 + \epsilon^2 + (\expect{}{g} - g)^2 + 2\epsilon(f - \expect{}{g}) + 2\epsilon(\expect{}{g}-g) + 2(f - \expect{}{g})(\expect{}{g} - g)} \\\\ &= \expect{}{(f - \expect{}{g})^2} + \expect{}{\epsilon^2} + \expect{}{(\expect{}{g} - g)^2} + 2\expect{}{\epsilon(f - \expect{}{g})} + 2\expect{}{\epsilon(\expect{}{g}-g)} + 2\expect{}{(f - \expect{}{g})(\expect{}{g} - g)} \\\\ &= \expect{}{(f - \expect{}{g})^2} + \expect{}{\epsilon^2} + \expect{}{(\expect{}{g} - g)^2} + 2\expect{}{\epsilon}\expect{}{(f - \expect{}{g})} + 2\expect{}{\epsilon}\expect{}{(\expect{}{g}-g)} + 2\expect{}{(f - \expect{}{g})(\expect{}{g} - g)} \\\\ &= \expect{}{(f - \expect{}{g})^2} + \expect{}{\epsilon^2} + \expect{}{(\expect{}{g} - g)^2} \\\\ &= (f - \expect{}{g})^2 + \sigma^2 + \text{Variance(g)} \\\\ &= \left[\text{Bias}(g)\right]^2 + \sigma^2 + \text{Variance(g)} \\\\ \label{eqn:expected-error} \end{align}

We have not included the \( \dataset \) in the subscript of each expectation to reduce clutter in the Equation, but it is there, and should be visualized as such.

This is known as the bias-variance decomposition of the expected squared loss. Such a decomposition is always possible for expected squared loss, irrespective of the machine learning model being used. It helps in understanding the complexity of the predictive model, as we will study next.

Implications of the bias-variance decomposition

From the decomposition, we observe that the error of our predictive model consists of three terms — bias of the model, variance of the model, and variance of the noise. Of these, the variance of the noise, \( \sigma^2 \), is beyond our control. We cannot reduce it during the modeling phase.

We can control the other two — the bias and variance of our predictive model. Ideally, we should strive to reduce both bias and variance to achieve better error. But with a challenge.

When we attempt to reduce bias, the variance of the model should not increase. Else, all reduction in bias will be futile, because the overall error may remain the same, or, in the worst case, even increase.

Let us understanding this in the context of the \(K\)-nearest neighbor model for regression.

Bias of the \(K\)-nearest neighbor model

We have studied in our comprehensive article on the \(K\)-nearest neighbor models that the predictive model for regression is the average of target variables \( y_i \) for training set observations \( \vx_i \) that are among the nearest \( K \) examples to the test point.

$$ \yhat_0 = g(\vx_0) = \frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0))} y_i $$

where, \( \mathbb{N}_K(\vx_0) \) is the neighborhood containing the \( K \)-nearest neighbors of the test point \( \vx_0 \).

The bias of this model, as a function of the hyperparameter \( K \), can be calculated as follows:

\begin{aligned} \text{Bias}_K &= f - \expect{}{g} \\\\ &= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} y_i} \\\\ &= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \left(f(\vx_i) + \epsilon_i\right)} \\\\ &= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)} + \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \epsilon_i} \\\\ &= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)} + \frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \expect{}{\epsilon_i} \\\\ &= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)} \end{aligned}

where, we have used the fact that \( \expect{}{\epsilon} = 0 \).

Let us understand the effect of changing the value of \( K \).

Since we need at least one neighbor to make an informed prediction, the minimum value is \( K = 1\). In this case, the predictive model depends on a single nearest example. If we have enough representative samples in the training set, the nearest neighbor of the test point will actually be very close to the test point. Mathematically, \( \vx \approx \vx_1 \). Consequently, if \( f \) is a smooth function, \( y \approx y_1 \). This means, bias of the one-nearest neighbor model is zero, asymptotically.

On the other hand, if \( K \) is large, the neighborhood \( \mathbb{N}_K(\vx_0) \) will be very wide. Smooth or not, unless \( f \) is constant over this wide neighborhood, the actual value of \( f(\vx_0) \) will be very different from that of the prediction \( g(\vx_0) \). This behavior will only get worse as we increase \( K \).

This means, the bias of the K-nearest neighbor model increases as \( K \) increases. Does this mean, \( K = 1 \) is always the best choice? Let's find out.

Variance of the \(K\)-nearest neighbor model

The variance of the \(K\)-nearest neighbor model for regression, as a function of \(K\), can be calculated as follows:

\begin{aligned} \text{Variance}_K &= \text{Variance}_K(g(\vx_0)) \\\\ &= \text{Variance} \left( \frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} y_i \right)\\\\ &= \frac{1}{K^2} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \text{Variance} \left(y_i\right) + \frac{1}{K^2} \sum_{\vx_i \in \mathbb{N}_K(\vx_0), \vx_j \in \mathbb{N}_K(\vx_0)} \text{Covariance} \left(y_i, y_j \right)\\\\ &= \frac{1}{K^2} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \text{Variance} \left(f(\vx_i) + \epsilon_i \right)\\\\ &= \frac{1}{K^2} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \left[ \text{Variance} \left(f(\vx_i)\right) + \text{Variance}\left(\epsilon_i\right) \right]\\\\ &= \frac{1}{K^2} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} 0 + \sigma^2 \\\\ &= \frac{1}{K^2} K\sigma^2 \\\\ &= \frac{\sigma^2}{K} \\\\ \end{aligned}

To arrive at this result, we have used the following formulae.

  1. \( \text{Variance}(aZ) = a^2 \text{Variance}(Z) \), if \( a \) is a scalar multiplier on the random variable \( Z \).
  2. Variance of a sum of random variables decomposes as follows: \(\text{Variance}(\sum_{m=1}^M z_m) = \sum_{m=1}^M \text{Variance}(z_m) + \sum_{m=1}^M\sum_{l=1}^L \text{Covariance}(z_m,z_l) \)
  3. In our case, \( \text{Covariance}(y_i,y_j) = 0 \), for all \( i \ne j \), because the data points have been sampled I.I.D.
  4. For the noise, \( \text{Variance}(\epsilon) = \sigma^2 \)
  5. In this case, \( \text{Variance}(f(\vx)) = 0 \), because \( f \) is a deterministic function of its input.

Let us now understand the effect of changing the value of \( K \).

In the case of variance, the variance of the model is inversely proportional to the value of \( K \). When \( K \) increases, the variance decreases. When \( K \) is reduced, the variance becomes larger.

This is also intuitive to observe: As we increase \( K \), the predictive model is dependent on more training examples. For very large values of \( K \), the predictive model is not going to change much if some of those points are missing from the training set. In other words, larger values of \( K \) make the predictive model less susceptible to changes in the training samples, thereby leading to lower variance.

Conversely, smaller values of \( K \) make the predictive model heavily dependent on the closest points. Missing a single training point from a sample may lead to drastically different prediction for the same test point.

Bias-variance trade-off

As we saw in the case of the K-nearest neighbor model, the effect of \( K \) is opposite on bias and variance — Larger values of \( K \) lead to higher bias and lower variance and smaller values of \( K \) lead to lower bias and higher variance. This trade-off is not limited to just K-nearest neighbors.

In fact, it an example of the bias-variance trade-off that is common to all machine learning models. More complex models, those that can conform closely to fit the training examples, typically have higher variance and lower bias. Simpler models, have lower bias. Training good predictive models involves dealing with this trade-off to arrive at models with just the right mix of variance and bias.

Practical implications

Although good for theoretical understanding of the trade-off, the bias and variance is seldom, if ever, calculated in practice. That's because to compute bias and variance, you have to compute expectations of the predictive model over samples of training dataset.

How do we choose good models in practice then, in the light of the bias and variance?

Variance is typically the result of model variability due to changing training samples. Complicated models with many free parameters (high degree of freedom) are more susceptible to such changes in the training set. This means, if you have a very small training sample, you should be wary of model variance and opt to stick to high bias, low variance models. In other words, when learning with limited training data, it is better to choose simpler models with few parameters. If you have large datasets with millions of training examples, be bold and training complex models. In fact, the availability of large datasets is one of the primary reasons for the recent successes in deep learning, a family of high-complexity models.

Instead of computing bias and variance explicitly, it is typically a good idea to just compare predictive models on their expected loss by using methods such as cross-validation.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.