Bias of the \(K\)-nearest neighbor model
We have studied in our comprehensive article on the \(K\)-nearest neighbor models that the predictive model for regression is the average of target variables \( y_i \) for training set observations \( \vx_i \) that are among the nearest \( K \) examples to the test point.
$$ \yhat_0 = g(\vx_0) = \frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0))} y_i $$
where, \( \mathbb{N}_K(\vx_0) \) is the neighborhood containing the \( K \)-nearest neighbors of the test point \( \vx_0 \).
The bias of this model, as a function of the hyperparameter \( K \), can be calculated as follows:
\begin{aligned}
\text{Bias}_K &= f - \expect{}{g} \\\\
&= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} y_i} \\\\
&= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \left(f(\vx_i) + \epsilon_i\right)} \\\\
&= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)} + \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \epsilon_i} \\\\
&= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)} + \frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} \expect{}{\epsilon_i} \\\\
&= f(\vx_0) - \expect{}{\frac{1}{K} \sum_{\vx_i \in \mathbb{N}_K(\vx_0)} f(\vx_i)}
\end{aligned}
where, we have used the fact that \( \expect{}{\epsilon} = 0 \).
Let us understand the effect of changing the value of \( K \).
Since we need at least one neighbor to make an informed prediction, the minimum value is \( K = 1\).
In this case, the predictive model depends on a single nearest example.
If we have enough representative samples in the training set, the nearest neighbor of the test point will actually be very close to the test point.
Mathematically, \( \vx \approx \vx_1 \). Consequently, if \( f \) is a smooth function, \( y \approx y_1 \).
This means, bias of the one-nearest neighbor model is zero, asymptotically.
On the other hand, if \( K \) is large, the neighborhood \( \mathbb{N}_K(\vx_0) \) will be very wide.
Smooth or not, unless \( f \) is constant over this wide neighborhood, the actual value of \( f(\vx_0) \) will be very different from that of the prediction \( g(\vx_0) \).
This behavior will only get worse as we increase \( K \).
This means, the bias of the K-nearest neighbor model increases as \( K \) increases.
Does this mean, \( K = 1 \) is always the best choice? Let's find out.