Introduction to ridge regression

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Ridge regression
        Machine Learning
      

Introduction

Ridge regression is a regularized version of linear least squares regression. It works by shrinking the coefficients or weights of the regression model towards zero. This is achieved by imposing a squared penalty on their size.

Prerequisites

To understand ridge regression, we recommend familiarity with the concepts in

Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
Linear algebra
Linear least squares regression

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_N] $.

We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

Predictive model

The predictive model of ridge regression is the same as that of linear least squares regression. It is a linear combination of the input features with an additional bias term.

\begin{equation} \hat{y} = \vx^T \vw + b \label{eqn:reg-pred} \end{equation}

where $ \vw $ are known as the weights or parameters of the model and $ b $ is known as the bias of the model. The parameters are an $N$-dimensional vector, $ \vw \in \real^N $, just like the input. The bias term is a real-valued scalar, $ b \in \real $.

Training

Training a ridge regression model involves discovering suitable weights $ \vw $ and bias $ b $.

The training approach fits the weights to minimize the squared prediction error on the training data. Specifically in the case of ridge regression, there is an additional term in the loss function — a penalty on the sum of squares of the weights.

Suppose $ \labeledset = \set{(\vx_1, y_1), \ldots, (\vx_\nlabeled, y_\nlabeled)} $ denotes the training set consisting of $ \nlabeled $ training instances. If $ \yhat_\nlabeledsmall $ denotes the prediction of the model for the the instance $ (\vx_\nlabeledsmall, y_\nlabeledsmall) $, then the squared error over a single training example is

\begin{aligned} \ell(y_\nlabeledsmall, \yhat_\nlabeledsmall) &= \left( y_\nlabeledsmall - \yhat_\nlabeledsmall \right)^2 \\\\ &= \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T\vw - b \right)^2 \end{aligned}

The overall loss over the training set is the sum of these squared errors and the penalty involving the sum of squares of the weights.

\begin{equation} \mathcal{L}(\labeledset) = \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda \norm{\vw}{}^2 \label{eqn:ridge-loss} \end{equation}

Here, the hyperparameter $ \lambda $ controls the amount of penalty on the weights. Larger values of $ \lambda $ enforces strict reduction in the magnitude of the weight vector. Smaller values have the opposite effect of allowing weights with larger magnitudes.

The model parameters are fit to the training data by minimizing the loss above.

$$ \star{\vw} = \argmin_{\vw} \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw - b\right)^2 + \lambda \norm{\vw}{}^2 $$

Note that the loss function is a quadratic function. Therefore, its minimum always exists. Moreover, in the case of ridge regression, the solution is unique. This was not the case with vanilla linear least squares. We will study the reason for this in a bit.

Why penalize the weights?

We cover the general motivations behind coefficient penalization in more detail in our comprehensive article on regularization techniques. Here, we provide some intuition on coefficient shrinkage in the context of ridge regression.

An alternative way to express the loss in Equation \eqref{eqn:ridge-loss} is as follows:

\begin{align} \star{\vw} =& \argmin_{\vw} \sum_{\nlabeledsmall=1}^\nlabeled \left(y_m - \vx_\nlabeledsmall^T \vw - b\right)^2 \\\\ & \text{subject to } \sum_{\ndimsmall=1}^{\ndim} w_\ndimsmall^2 \le s \label{eqn:ridge-loss-alternate} \end{align}

This is similar in spirit to the loss in Equation \eqref{eqn:ridge-loss} because the size constraint $ s $ has a similar effect as that of $ \lambda $. Larger values of $ s $ allow coefficients with larger magnitudes, just like a smaller value $ \lambda $ would do. Smaller values of $ s $ would constrain the weights to smaller magnitudes, similar to a larger value of $ \lambda $.

Now, imagine if there was no size constraint. In other words, $ s = \infty $. What would happen? Some weights will become largely positive. To maintain the same loss value and predictive capability, some other weights will become increasingly negative to counter these weights and still lead to the same $ \yhat $. In fact, there is no limit to the variation in the weights that solve this minimization problem.

This means, with no constraint, a unique solution to the optimization problem cannot be guaranteed. This problem is prevented by imposing a size constraint, or regularization penalty on the weights.

Why is the bias term not penalized?

Notice that the bias term has been left out of the penalty term of the loss in Equation \eqref{eqn:ridge-loss}.

It is natural to ask the question: If the bias term is also a parameter of the regression model, then why don't we regularize it?

Consider this thought experiment. If a constant term $ c $ is added to each of the target $ y_i$'s, then the entire predictive model should shift accordingly. During training, the bias term will adapt to include this constant term, so that predictions from the trained model also reflect a constant addendum of $ c $.

Thus, intuitively, the bias term is centering the linear predictive model. Any constant added to all targets, merely shifts the center of the target variables, affecting the bias term.

Now, if we imposed a shrinking penalty on the bias term, it will be forced towards zero. It will be unable to model this centering effect.

Therefore, we do not penalize the bias term.

Preprocessing the input

Unlike linear least squares regression, it is particularly important to preprocess input features for ridge regression.

We are constraining the weights. For this to work without compromising performance, we should standardize or scale the input features, so that all weights are also along the same scale, amenable to regularization. For more details, refer to our comprehensive article on standardizing and scaling machine learning datasets.
We noted that the bias term in the predictive model centers the target variables. This means, if we pre-center the data, we may not even need the bias term! First, we center the input vectors as $ \vx_i^{(c)} = \vx_i - \bar{\vx} $, for all input vectors $ \vx_i $ in the dataset. Here, $ \bar{\vx} $ is the mean of all the input observations. Then, we estimate the centered bias term $ b^{(c)} $ as

$$ b^{(c)} = \frac{1}{\nlabeled} \sum_{\nlabeledsmall=1}^{\nlabeled} y_\nlabeledsmall $$

Why would this centering lead to the same solution as the original problem? Let's find out.

Effect of centering on the solution

Here's the loss for ridge regression from Equation \eqref{eqn:ridge-loss}, written in its non-vectorized form.

\begin{align} \mathcal{L}(\labeledset) &= \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \sum_{\ndimsmall=1}^\ndim x_{m\ndimsmall} w_\ndimsmall - b\right)^2 + \lambda \sum_{\ndimsmall=1}^\ndim w_\ndimsmall^2 \\\\ &= \sum_{\nlabeledsmall=1}^\nlabeled \left[y_\nlabeledsmall - \left(\sum_{\ndimsmall=1}^\ndim (x_{m\ndimsmall} - \bar{x}_\ndimsmall) w_\ndimsmall \right) - \sum_{\ndimsmall=1}^\ndim \bar{x}_{\ndimsmall} w_\ndimsmall - b\right]^2 + \lambda \sum_{\ndimsmall=1}^\ndim w_\ndimsmall^2 \\\\ &= \sum_{\nlabeledsmall=1}^\nlabeled \left[y_\nlabeledsmall - \left(\sum_{\ndimsmall=1}^\ndim (x_{m\ndimsmall} - \bar{x}_\ndimsmall) w_\ndimsmall^{(c)} \right) - b^{(c)}\right]^2 + \lambda \sum_{\ndimsmall=1}^\ndim \left(w_\ndimsmall^{(c)}\right)^2 \\\\ \label{eqn:ridge-loss-centering} \end{align}

Here, we have defined the centered coefficients and bias as

\begin{align} b^{(c)} &= \sum_{\ndimsmall=1}^\ndim \bar{x}_{\ndimsmall} w_\ndimsmall + b w_\ndimsmall^{(c)} &= w_\ndimsmall \end{align}

Clearly, $ w_\ndimsmall^{(c)} $ will minimize the loss with centered inputs exactly when $ w_\ndimsmall $ minimizes the uncentered one.

What about the centered bias $ b^{(c)} $? What value minimizes the loss? Let's find out.

For that, we take the derivative of the loss with respect to $ b^{(c)} $ and set it to zero. That is

$$ \sum_{\nlabeledsmall=1}^\nlabeled \left[y_\nlabeledsmall - \left(\sum_{\ndimsmall=1}^\ndim (x_{m\ndimsmall} - \bar{x}_\ndimsmall) w_\ndimsmall^{(c)} \right) - b^{(c)}\right] = 0 $$

This implies, $ b^{(c)} = \frac{1}{\nlabeled} y_\nlabeledsmall = \bar{\vy}$, the average of all the target variables in the training set. Thus, if we center the target variables, we do not even have to include the bias in the loss, because, continuing from Equation \eqref{eqn:ridge-loss-centering}, we can substitute the solution of $ b^{(c)} $ to get

\begin{align} \mathcal{L}(\labeledset) &= \sum_{\nlabeledsmall=1}^\nlabeled \left[y_\nlabeledsmall - \left(\sum_{\ndimsmall=1}^\ndim (x_{m\ndimsmall} - \bar{x}_\ndimsmall) w_\ndimsmall^{(c)} \right) - \bar{\vy} \right]^2 + \lambda \sum_{\ndimsmall=1}^\ndim \left(w_\ndimsmall^{(c)}\right)^2 \\\\ &= \sum_{\nlabeledsmall=1}^\nlabeled \left[y_\nlabeledsmall^{(c)} - \left(\sum_{\ndimsmall=1}^\ndim (x_{m\ndimsmall} - \bar{x}_\ndimsmall) w_\ndimsmall^{(c)} \right) \right]^2 + \lambda \sum_{\ndimsmall=1}^\ndim \left(w_\ndimsmall^{(c)}\right)^2 \label{eqn:ridge-loss-centering-2} \end{align}

where, we have centered the target variable as $ y_\nlabeledsmall^{(c)} = y_\nlabeledsmall - \bar{\vy} $.

For the remaining analysis of ridge regression, we will assume that the input has been centered, so that the bias term can be ignored from the analysis.

Loss in the matrix form

With centering on the inputs, we can represent the inputs as a matrix $ \mX $, where, each row is a training instance $ \vx $. Here, $ \mX \in \real^{\nlabeled \times \ndim}$ is a matrix containing the training instances such that each row of $ \mX $ is a training instance $ \vx_\nlabeledsmall $ for all $ \nlabeledsmall \in \set{1, 2, \ldots, \nlabeled} $.

Also, the set of centered target variables can be represented as the vector $ \vy $, with the $ i $-th element of $ \vy $ representing the target variable of the $ i $-th row of $ \mX $. Thus, $ \vy \in \real^\nlabeled $ is a vector containing the target variables $ y_\nlabeledsmall $ for all $ \nlabeledsmall \in \set{1, 2, \ldots, \nlabeled} $.

With this notation, as we longer have to worry about the bias term, we can write the ridge regression loss function as

$$ \mathcal{L}(\labeledset) = \left(\vy - \mX\vw\right)^T (\vy - \mX\vw) + \lambda \vw^T\vw $$

Pretty convenient, isn't it?

Finding optimal coefficients $ \star{\vw} $

In the matrix form, it is even easier to express the steps to find the solution to the ridge regression loss.

We just take the derivative of the loss with respect to the parameters $ \vw $ and set those to zero. This results in the following steps towards the solution to $ \vw $

\begin{align} &\frac{\partial \loss(\labeledset)}{\partial \vw} = 0 \\\\ \implies& -2\mX^T(\vy - \mX\vw) + 2\lambda\vw = 0 \\\\ \implies& -\mX^T\vy + \mX^T\mX\vw + \lambda\vw = 0 \\\\ \implies& \left(\mX^T\mX + \lambda\mI\right)\vw = \mX^T\vy \\\\ \implies& \vw = \left(\mX^T\mX + \lambda\mI\right)^{-1}\mX^T\vy \end{align}

In the last step, we have taken the inverse of $ \left(\mX^T\mX + \lambda\mI\right) $. Even if $ \mX^T\mX $ is not full rank, the matrix $ \left(\mX^T\mX + \lambda\mI\right) $ is invertible. Because, adding some positive value $ \lambda $ along the diagonals of $ \mX^T\mX $, turns it into a nonsingular matrix, making it invertible. To under this better, refer to our comprehensive article on singular matrices.

In fact, avoiding the possibility of singular matrix $ \mX^T\mX $ was the primary motivation behind introducing the $ \lambda $ term in the solution refnum-singular. Compare this to the solution for vanilla linear least squares, wherein, we had to assume that the matrix $ \mX^T\mX $ is invertible. In the case of ridge regression, we make no such assumption, because it is invertible!

There, we have a closed form solution for the optimal coefficients of ridge regression model.

Training demo

As you will see in this demo, the training is instantaneous due to the closed-form solution for the optimal value of the parameters that we arrived at in the previous section.

Automatically fit linear model (blue line) to the training data.

Note that increasing the value of $ \lambda $ increases the effect of regularization, leading to a reduction in the magnitude of the weight $ w $. This has the undesirable effect of slight increase in the value of sum of squared errors (SSE).

Dealing with feature types

Note that the predictive model involves a dot product of the weight vector $ \vw $ and the instance vector $ \vx $. This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct dot product with the weight vector is not meaningful. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation. a

Where to next?

An alternative to ridge regression is the lasso regression model, another regularized linear model for regression. To model nonlinear functions, a popular alternative is kernel regression.

Regression methods deal with real-valued outputs. For categorical outputs, it is better to use classification models such as logistic regression.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Ridge regression

Machine Learning