## Minimizing the training loss

In the matrix notation, the sum of squared errors is written as

$$ \loss(D) = \left(\vy - \mX\vw\right)^T (\vy - \mX\vw) $$

Here, \( \mX \in \real^{\nlabeled \times (\ndim+1)}\) is a matrix containing the training instances such that each row of \( \mX \) is a training instance \( \vx_\nlabeledsmall \) for all \( \nlabeledsmall \in \set{1, 2, \ldots, \nlabeled} \).
Similarly, \( \vy \in \real^\nlabeled \) is a vector containing the target variables \( y_\nlabeledsmall \) for all \( \nlabeledsmall \in \set{1, 2, \ldots, \nlabeled} \).

Note that the **loss function is a quadratic function** of the parameters \( \vw \). Therefore, its minimum always exists, but it may not be unique.

Being a quadratic function, we find the minimizer by differentiating with respect to the parameters of the model — \( \vw \).
Setting the derivative to zero, the resulting *normal equation* is

\begin{aligned}
\doh{\loss(D)}{\vw} &= 0 \\\\
\implies& \mX^T \left(\vy - \mX \vw\right) = 0 \\\\
\implies& \mX^T\vy - \mX^T\mX\vw = 0 \\\\
\implies& \mX^T\vy = \mX^T\mX\vw
\end{aligned}

If \( \mX^T\mX \) is nonsingular (invertible), then the unique solution can be obtained by rearranging the terms of the above equation as

$$ \vw = \left(\mX^T\mX\right)^{-1} \mX^T \vy $$

What if \( \mX^T\mX \) is singular, and hence not invertible? We deal with this in ridge regression.