Training
Training a lasso regression model involves discovering suitable weights \( \vw \) and bias \( b \).
The training approach fits the weights to minimize the squared prediction error on the training data.
Specifically in the case of lasso regression, there is an additional term in the loss function — a penalty on the sum of absolute values of the weights.
Suppose \( \labeledset = \set{(\vx_1, y_1), \ldots, (\vx_\nlabeled, y_\nlabeled)} \) denotes the training set consisting of \( \nlabeled \) training instances.
If \( \yhat_\nlabeledsmall \) denotes the prediction of the model for the the instance \( (\vx_\nlabeledsmall, y_\nlabeledsmall) \), then the squared error over a single training example is
\begin{aligned}
\ell(y_\nlabeledsmall, \yhat_\nlabeledsmall) &= \left( y_\nlabeledsmall - \yhat_\nlabeledsmall \right)^2 \\\\
&= \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T\vw \right)^2
\end{aligned}
The overall loss over the training set is the sum of these squared errors and the penalty involving the sum of squares of the weights.
\begin{equation}
\mathcal{L}(\labeledset) = \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda |\vw|
\label{eqn:lasso-loss}
\end{equation}
The \( L_1 \)-norm of the weights is simply the sum of absolute values of the weights, so that
$$ L_1(\vw) = |\vw| = \sum_{\ndimsmall=1}^{\ndim} |w_\ndimsmall| $$
Also, the hyperparameter \( \lambda \) controls the amount of penalty on the weights.
Larger values of \( \lambda \) enforces strict reduction in the magnitude of the weight vector.
Smaller values have the opposite effect of allowing weights with larger magnitudes.
As a hyperparameter, \( \lambda \) is typically chosen via cross-validation.
The model parameters are fit to the training data by minimizing the loss above.
$$ \star{\vw} = \argmin_{\vw} \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda |\vw| $$
Just as in the case of ridge regression, we center the target variables and fit the model without the bias term.
Note that the loss function is no longer a quadratic function of the parameters \( \vw \).
Therefore, minimizing it requires a different strategy compared to that used in ridge regression and linear least squares regression, which were derivative based; the loss function of the lasso is not differentiable.
Instead, a quadratic programming algorithm is used to solve for the solution.