Introduction to Bayesian linear regression

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Bayesian linear regression
        Machine Learning
      

Introduction

In this article, we will introduce a Bayesian analysis of the standard linear regression model with Gaussian noise. Compared to the linear least squares models for regression, in the Bayesian treatment, we do not find the optimal values of the parameters. Instead we perform inference on the Bayesian model to estimate the posterior distribution over weights. For predictions on test examples, we average over all possible parameter values, weighted by their posterior probability.

Understanding this model will be crucial to study the more general Gaussian processes for regression.

Prerequisites

To understand Bayesian linear regression, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In regression, the goal of the predictive model is to predict a continuous valued output for a given multivariate instance. In this article, for simplicity, we will work with real-valued input observations.

Consider such an instance $ \vx \in \real^\ndim $, a vector consisting of $ \ndim $ features, $\vx = [x_1, x_2, \ldots, x_\ndim] $.

We need to predict a real-valued output $ \hat{y} \in \real $ that is as close as possible to the true target $ y \in \real $. The hat $ \hat{ } $ denotes that $ \hat{y} $ is an estimate, to distinguish it from the truth.

In the standard linear regression model with Gaussian noise, the actual target $ y $ is related to the input $ \vx $ through some function $ f: \real^\ndim \to \real $ such that

$$ y = f(\vx) + \epsilon $$

where, $ \epsilon $ is zero-centered Gaussian noise with variance $ \sigma^2 $. This means, $ \epsilon \sim \Gauss(0,\sigma^2) $.

The predictive model is inferred over a collection of supervised observations provided as tuples $ (\vx_i,y_i) $ containing the instance vector $ \vx_i $ and the true target variable $ y_i $. This collection of labeled observations is known as the training set $ \labeledset = \set{(\vx_1,y_1), \ldots (\vx_\nlabeled,y_\nlabeled)} $. Typically, these examples are supposed to be independent and identically distributed random variables.

Bayesian model for linear regression

In the Bayesian model for linear regression CITE[rasmussen-226], we assume the function $ f(\vx) $ to be a linear function of its input.

$$ f(\vx) = \vx^T \vw $$

where, $ \vw $ is the real-valued vector representing the parameters of the model, the so-called weights of the model. It has the same dimensionality as the input. This means, $ \vw \in \real^\ndim $.

In the Bayesian approach, there is prior over the weights. In this analysis, we use a Gaussian prior with zero mean and the covariance matrix $ \mSigma $. This means,

$$ \vw \sim \Gauss(0, \mSigma) $$

Predictive model

In non-Bayesian approaches to linear regression, we typically choose a single value of the parameter vector $ \vw $ that best fits the available training data, subject to some loss. For example, in the linear least squares model, the best parameter setting, $ \star{\vw} $, is chosen to minimize the mean squared error over predictions to actuals on the training data. The predictions in the non-Bayesian schemes then is simply $ \star{f}(\vx) = \vx^T \star{\vw} $.

In the Bayesian approach, we do not discover a single value of the model parameters. We instead infer the posterior distribution over the weights, given the training set. That is, $ p(\vw | \labeledset) $. Then, the prediction on a single test instance $ \vx $ involves averaging over all possible parameter settings, weighted by their posterior probability.

So, in the Bayesian approach, the predictive model is

\begin{equation} \hat{y} = \int p(f(\vx) | \vw) p(\vw | \labeledset) d\vw \label{eqn:reg-pred} \end{equation}

Thus, to be able to make predictions, we need to infer $ p(\vw | \labeledset) $. We will arrive at this result in the upcoming sections.

Training

As mentioned earlier, training in non-Bayesian approaches amounts to finding a single best value of $ \vw $ by minimizing a loss, typically mean-squared error of the predictions to the actual target values in the training data.

In the Bayesian approach, we instead perform inference — the estimation of the posterior probability of the weights, $ p(\vw | \labeledset) $.

Using the Bayes rule, this can be calculated as

$$ p(\vw | \labeledset) = \frac{p(\vy_\labeledset | \mX_\labeledset, \vw) p(\vw)}{p(\vy_\labeledset|\mX_\labeledset)} $$

where,

$ p(\vw | \labeledset) $ is the posterior probability of the weights,
$ p(\vw) $ is the prior probability of the weights,
$ p(\vy_\labeledset | \mX_\labeledset, \vw) $ is the likelihood of the target variables given the input and the weights. For simplicity, we have combined all the actual target values into the vector $ \vy_\labeledset $ so that $ \vy_\labeledset = [y_1,\ldots,y_\nlabeled] $. Similarly, all inputs have been combined into a single matrix, $ \mX_\labeledset = [\vx_1,\ldots,\vx_\nlabeled] $.
$ p(\vy_\labeledset | \mX_\labeledset) $ is the marginal likelihood. Being a marginal, it does not depend on the model parameters. It is therefore a normalizing constant. It is given by, $$ p(\vy_\labeledset | \mX_\labeledset) = \int p(\vy_\labeledset | \mX_\labeledset, \vw) p(\vw) d\vw $$

To infer the posterior distribution, typically we do not have to calculate the normalizing constant. We inspect the product of the prior and the likelihood to identify the nature of the distribution. By completing appropriate terms, we can guess the normalizing constant in most cases.

The likelihood $ p(\vy_\labeledset | \mX_\labeledset, \vw) $

Because the examples in the training set are IID, the likelihood factorizes over the examples the training set. This means,

\begin{aligned} p(\vy_\labeledset | \mX_\labeledset, \vw) &= \prod_{\nlabeledsmall=1}^{\nlabeled} p(y_\nlabeledsmall | \vx_\nlabeledsmall, \vw) \\\\ &= \prod_{\nlabeledsmall=1}^{\nlabeled} \frac{1}{\sqrt{2 \pi \sigma^2}} \textexp{- \frac{(y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw)^2}{2\sigma^2}} \\\\ &= \frac{1}{(2 \pi \sigma)^{\nlabeled/2}} \textexp{- \frac{\sum_{\nlabeledsmall=1}^\nlabeled (y_\nlabeledsmall - \vx_\nlabeledsmall^T \vw)^2}{2\sigma^2}} \\\\ &= \frac{1}{(2 \pi \sigma)^{\nlabeled/2}} \textexp{- \frac{|\vy_\labeledset - \mX_\labeledset^T \vw|^2}{2\sigma^2}} \\\\ &= \Gauss(\mX_\labeledset^T \vw, \sigma^2 \mI) \end{aligned}

The penultimate step seemed familiar like the probability density function of a multivariate Gaussian. This allows for easy modeling of the likelihood as a Gaussian distribution for predictive purposes.

The posterior $ p(\vw | \labeledset) $

Ignoring the normalizing constant, we can now write the posterior distribution as being proportional to the product of the likelihood and the prior.

\begin{aligned} p(\vw | \labeledset) &\propto p(\vy_\labeledset | \mX_\labeledset, \vw) p(\vw) \\\\ &\propto \Gauss(\mX_\labeledset^T \vw, \sigma^2 \mI) \Gauss(0, \Sigma) \\\\ &\propto \textexp{- \frac{(\vy_\labeledset - \mX_\labeledset^T \vw)^T(\vy_\labeledset - \mX_\labeledset^T \vw)}{2\sigma^2}} \textexp{- \frac{1}{2} \vw^T \mSigma^{-1} \vw} \\\\ &\propto \textexp{- \frac{1}{2}(\vw - \bar{\vw})^T \left( \frac{1}{\sigma^2} \mX_\nlabeled \mX_\nlabeled^T + \mSigma^{-1} \right) (\vw - \bar{\vw})} \end{aligned}

where, we have consumed all normalizing constants ( terms that do not involve $ \vw $) into the proportionality and for simplicity, set

$$ \bar{\vw} = \frac{1}{\sigma^2} \left( \frac{\mX_\labeledset \mX_\labeledset^T}{\sigma^2} + \mSigma^{-1} \right)^{-1} \mX_\labeledset \vy_\labeledset $$

Note the form of the last equation in the expansion of $ p(\vw | \labeledset) $. It seems like another multivariate Gaussian, so that

$$ p(\vw | \labeledset) = \Gauss(\bar{\vw}, \inv{\mA}) $$

where, the covariance matrix $ \mA^{-1} $ represents

$$ \mA^{-1} = \frac{\mX_\labeledset \mX_\labeledset^T}{\sigma^2} + \mSigma^{-1} $$

The final predictive model

Now that we have inferred the posterior probability $ p(\vw | \labeledset) $, we can write the final predictive model of the Bayesian linear regression approach.

Starting at the predictive model defined in Equation \eqref{eqn:reg-pred}, we substitute the results for relevant probabilities to arrive that the result.

\begin{aligned} \hat{y} &= \int p(f(\vx) | \vw) p(\vw | \labeledset) d\vw \\\\ &= \Gauss(\vx^T\bar{\vw}, \vx^T\mA^{-1}\vx) \\\\ \label{eqn:blr-pred-final} \end{aligned}

Relationship to ridge regression

We have presented ridge regression in another article. It is a linear least squares model, with an additional $ L_2 $ regularization term on the weight parameters. The solution of the ridge regression model is $ \star{\vw}_{\text{ridge}} $, calculated as

$$ \star{\vw}_{\text{ridge}} = \left(\mX_\labeledset\mX_\labeledset^T + \lambda\mI\right)^{-1}\mX_\labeledset\vy_\labeledset $$

Although not a Bayesian approach, the solution of the ridge regression is similar to the maximum-a-posteriori (MAP) — the mode of the posterior distribution of the weights we found earlier. Being a Gaussian distribution, the mode is same as the mean. This means, the MAP estimate of $ \vw $ is $ \bar{\vw} $, that we calculated to be

$$ \star{\vw}_{\text{MAP}} = \bar{\vw} = \frac{1}{\sigma^2} \left( \frac{\mX_\labeledset \mX_\labeledset^T}{\sigma^2} + \mSigma^{-1} \right)^{-1} \mX_\labeledset \vy_\labeledset $$

Except the terms that explicitly model the noise distribution ( $ \sigma^2 $ ), it can be observed that the optimal solution $ \star{\vw} $ of ridge regression is quite similar to the MAP estimate $ \star{\vw}_{\text{ridge}} $ if the prior is isotropic Gaussian, that is, $ \mSigma^{-1} = \lambda \mI $.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Bayesian linear regression

Machine Learning