Lasso regression

Machine Learning

Introduction

Just like ridge regression, the lasso regression approach to a linear regression model is a coefficient shrinkage approach to linear least squares. While ridge regression penalizes the sum of squares of coefficients of the model, the lasso penalizes the \( L_1 \) norm of the coefficients — the sum of absolute values of the coefficients. This leads to subtle but important differences from ridge regression.

Prerequisites

To understand lasso regression, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance.

Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).

We need to predict a real-valued output \( \hat{y} \in \real \) that is as close as possible to the true target \( y \in \real \). The hat \( \hat{ } \) denotes that \( \hat{y} \) is an estimate, to distinguish it from the truth.

Predictive model

The predictive model of lasso regression is the same as that of linear least squares regression and ridge regression. It is a linear combination of the input features with an additional bias term.

\begin{equation} \hat{y} = \vx^T \vw + b \label{eqn:reg-pred} \end{equation}

where \( \vw \) are known as the weights or parameters of the model and \( b \) is known as the bias of the model. The parameters are an \(N\)-dimensional vector, \( \vw \in \real^N \), just like the input. The bias term is a real-valued scalar, \( b \in \real \).

Training

Training a lasso regression model involves discovering suitable weights \( \vw \) and bias \( b \).

The training approach fits the weights to minimize the squared prediction error on the training data. Specifically in the case of lasso regression, there is an additional term in the loss function — a penalty on the sum of absolute values of the weights.

Suppose \( \labeledset = \set{(\vx_1, y_1), \ldots, (\vx_\nlabeled, y_\nlabeled)} \) denotes the training set consisting of \( \nlabeled \) training instances. If \( \yhat_\nlabeledsmall \) denotes the prediction of the model for the the instance \( (\vx_\nlabeledsmall, y_\nlabeledsmall) \), then the squared error over a single training example is

\begin{aligned} \ell(y_\nlabeledsmall, \yhat_\nlabeledsmall) &= \left( y_\nlabeledsmall - \yhat_\nlabeledsmall \right)^2 \\\\ &= \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T\vw \right)^2 \end{aligned}

The overall loss over the training set is the sum of these squared errors and the penalty involving the sum of squares of the weights.

\begin{equation} \mathcal{L}(\labeledset) = \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda |\vw| \label{eqn:lasso-loss} \end{equation}

The \( L_1 \)-norm of the weights is simply the sum of absolute values of the weights, so that

$$ L_1(\vw) = |\vw| = \sum_{\ndimsmall=1}^{\ndim} |w_\ndimsmall| $$

Also, the hyperparameter \( \lambda \) controls the amount of penalty on the weights. Larger values of \( \lambda \) enforces strict reduction in the magnitude of the weight vector. Smaller values have the opposite effect of allowing weights with larger magnitudes. As a hyperparameter, \( \lambda \) is typically chosen via cross-validation.

The model parameters are fit to the training data by minimizing the loss above.

$$ \star{\vw} = \argmin_{\vw} \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda |\vw| $$

Just as in the case of ridge regression, we center the target variables and fit the model without the bias term.

Note that the loss function is no longer a quadratic function of the parameters \( \vw \). Therefore, minimizing it requires a different strategy compared to that used in ridge regression and linear least squares regression, which were derivative based; the loss function of the lasso is not differentiable. Instead, a quadratic programming algorithm is used to solve for the solution.

Effect of \(L_1 \) penalty

Ridge regression uses the \( L_2 \) penalty on the coefficient of the model. It has the net effect of shrinking all the coefficients towards zero. What about lasso?

The \( L_1 \) penalty not only results in shrinkage, but also into complete suppression of some coefficients to exactly zero. To under the reasoning behind this, explore our comprehensive article on regularization techniques

The suppression of some coefficients to zero results in the implicit selection of the other coefficients. Owing this shrinkage and selection of coefficients, the \(L_1 \) penalty based regression is known as least absolute shrinkage and selection operator or the resulting acronym, LASSO.

This coefficient selection strategy is generally applicable and has been studied in the context of many machine learning models.

Training demo

As you will see in this demo, the training is not instantaneous, unlike that of vanilla linear least-squares regression or ridge regression. In those cases, the optimal solution for the parameters had a closed form. In the case of lasso, we have to use an iterative optimization approach.

Automatically fit linear model (blue line) to the training data.

Note that increasing the value of \( \lambda \) increases the effect of regularization, leading to a reduction in the magnitude of the weight \( w \). This has the undesirable effect of slight increase in the value of sum of squared errors (SSE).

Moreover, due to the sparsifying effect of the \( L_1 \) norm, a high penalty effectively results in \( w = 0 \) — a predictive model that is parallel to the input axis.

Dealing with feature types

Note that the predictive model involves a simple dot product between the weight vector \( \vw \) and the instance \( \vx \). This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.

Please share

Let your friends, followers, and colleagues know about this resource you discovered.

Let's connect

Please share your comments, questions, encouragement, and feedback.