# Linear least squares

## Introduction

Linear least squares is probably the earliest and most studied approach for regression — predicting continuous valued outputs from multivariate inputs. Before the advent of deep learning and its easy-to-use libraries, linear least squares regression and its variants were one of the most widely deployed regression approaches in the statistical domain.

## Prerequisites

To understand the linear regression model, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance.

Consider such an instance $\vx \in \real^N$, a vector consisting of $N$ features, $\vx = [x_1, x_2, \ldots, x_N]$.

We need to predict a real-valued output $\hat{y} \in \real$ that is as close as possible to the true target $y \in \real$. The hat $\hat{ }$ denotes that $\hat{y}$ is an estimate, to distinguish it from the truth.

## Predictive model

Linear least-squares regression, as the name suggests, uses a linear form for the predictive model. The predictive model is

$$\hat{y} = \vx^T \vw + b$$

where $\vw$ are known as the weights or parameters of the model and $b$ is known as the bias of the model. The parameters are an $N$-dimensional vectors, $\vw \in \real^N$, just like the input. The bias term is a real-valued scalar, $b \in \real$.

## Nature of the predictive model: Univariate perspective

Now, let us try to understand the effect of changing the weight $w$ and the bias $b$ on the predictive model, in a univariate setting, where, $x \in \real, w \in \real, b \in \real$. In the interactive below, you can modify $w$, $b$, and $x$ using corresponding slider or circle, to understand their impact of the predictions from the linear model — the blue line.

Observe a few characteristics of the predictive model here.

• The weight, $w$, has the net effect of rotating the predictive model — the line.
• The bias term, $b$, has the net effect of moving the line away from the origin.
• At $b = 0$, the predictive model passes through the origin.
• At $w = 0$, the function $wx + b$ is parallel to the input axis. Its distance from the input axis is controlled by the bias term. This is to be expected as values of the input $x$ stop having any influence on the output $y$.

As we will see in the next interactive demonstration, this behavior will extend to the multivariate setting.

## Nature of the predictive model: Multivariate perspective

Now, let us try to understand the effect of changing the weight vector $\vw$ and the bias $b$ on the predictive model. Control the weight vector $\vw$ by modifying the dragging the arrowhead. The bias may be modified by adjusting the slider. Being a multivariate function, we represent $\vw^T\vw + b$ as a colored plane over the input feature space, to indicate variation in the values of the linear model as a function of the input features.

To better understand the multivariate function, it is beneficial to look at it from the univariate perspectives along each input axis — $x_1$ and $x_2$.

Observe the following with this interactive demonstration.

• The weight vector $\vw$, as before, controls the direction of the growth of the function, the so-called gradient.
• When one component of $\vw$ is set to zero, for example, $\vw = [0,1]$, the corresponding perspective $x_1 \rightarrow \vw^T\vx + b$ becomes parallel to the corresponding input axis, $x_1$.
• The bias term again plays the role of moving the function plane away from the origin.

## Simplifying the notation

A simpler representation that facilitates with computation and implementation involves extending the input vector $\vx$ with a leading 1, such that $\vx' = [1, \vx]$. This way, the bias term $b$ can be included into the parameter vectors as $\vw' = [b, \vw]$. It can be easily verified that

$$\vx'^T \vw' = \vx^T \vw + b$$

In the rest of this article, we will use the notation $\vx$ to denote $\vx'$ and $\vw$ to denote $\vw'$ to keep the presentation clear and concise.

## Training linear least squares

Training a linear regression model involves discovering suitable weights $\vw$ and bias $b$.

As the name linear least squares suggests, the training approach fits the weights to minimize the squared prediction error.

Suppose $\labeledset = \set{(\vx_1, y_1), \ldots, (\vx_\nlabeled, y_\nlabeled)}$ denotes the training set consisting of $\nlabeled$ training instances. If $\yhat_\nlabeledsmall$ denotes the prediction of the model for the instance $(\vx_\nlabeledsmall, y_\nlabeledsmall)$, then the squared error is

\begin{aligned} \ell(y_\nlabeledsmall, \yhat_\nlabeledsmall) &= \left( y_\nlabeledsmall - \yhat_\nlabeledsmall \right)^2 \\\\ &= \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T\vw \right)^2 \end{aligned}

The overall loss over the training set is the sum of squared errors (SSE)

$$\mathcal{L}(\labeledset) = \sum_{\nlabeledsmall=1}^\nlabeled \left(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw\right)^2$$

## Manual training demo

Let's see if you can manually estimate good values of the parameters to minimize the squared error in the next demo.

Given some data points as the training set, your goal is to adjust the parameters of the predictive model such that the sum of squared errors is minimized. For simplicity of understanding, we will limit this interactive demonstration to the univariate perspective.

Adjust the value of $w$ and $b$ using the sliders to minimize the value of the sum of squared errors (SSE). Note that the data does not perfectly lie along a line. There is some inherent noise — a scenario common to machine learning problems.

Did your estimated model get close to $w = 1$ and $b = 3$?

Training a linear regression model on a given training dataset involves an optimization approach to adjust the values of $w$ and $b$ to achieve a similar result.

## Minimizing the training loss

In the matrix notation, the sum of squared errors is written as

$$\loss(D) = \left(\vy - \mX\vw\right)^T (\vy - \mX\vw)$$

Here, $\mX \in \real^{\nlabeled \times (\ndim+1)}$ is a matrix containing the training instances such that each row of $\mX$ is a training instance $\vx_\nlabeledsmall$ for all $\nlabeledsmall \in \set{1, 2, \ldots, \nlabeled}$. Similarly, $\vy \in \real^\nlabeled$ is a vector containing the target variables $y_\nlabeledsmall$ for all $\nlabeledsmall \in \set{1, 2, \ldots, \nlabeled}$.

Note that the loss function is a quadratic function of the parameters $\vw$. Therefore, its minimum always exists, but it may not be unique.

Being a quadratic function, we find the minimizer by differentiating with respect to the parameters of the model — $\vw$. Setting the derivative to zero, the resulting normal equation is

\begin{aligned} \doh{\loss(D)}{\vw} &= 0 \\\\ \implies& \mX^T \left(\vy - \mX \vw\right) = 0 \\\\ \implies& \mX^T\vy - \mX^T\mX\vw = 0 \\\\ \implies& \mX^T\vy = \mX^T\mX\vw \end{aligned}

If $\mX^T\mX$ is nonsingular (invertible), then the unique solution can be obtained by rearranging the terms of the above equation as

$$\vw = \left(\mX^T\mX\right)^{-1} \mX^T \vy$$

What if $\mX^T\mX$ is singular, and hence not invertible? We deal with this in ridge regression.

## Automatic training demo

As you will see in this demo, the training is instantaneous due to the closed-form solution for the optimal value of the parameters that we arrived at in the previous section.

## Dealing with feature types

Note that the predictive model involves a dot product of the weight vector $\vw$ and the instance vector $\vx$. This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct dot product with the weight vector is not meaningful. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.

## Where to next?

Improvements upon the linear regression are suggested in ridge regression and lasso regression, both of which are still linear models for regression. To model nonlinear functions, a popular alternative is kernel regression.

Regression methods deal with real-valued outputs. For categorical outputs, it is better to use classification models such as logistic regression.