Introduction to norm-based regularization

This learning module has many interactive demos. It is easier to work with them on a larger screen. Bookmark and revisit if you are currently on a small screen device.

\(\DeclareMathOperator*{\argmax}{arg\,max} \DeclareMathOperator*{\argmin}{arg\,min} \DeclareMathOperator*{\asterisk}{\ast} \newcommand{\sup}{\text{sup}} \newcommand{\inf}{\text{inf}} \newcommand{\min}{\text{min}\;} \newcommand{\max}{\text{max}\;} \newcommand{\maxunder}[1]{\underset{#1}{\max}} \newcommand{\minunder}[1]{\underset{#1}{\min}} \newcommand{\real}{\mathbb{R}} \newcommand{\natural}{\mathbb{N}} \newcommand{\integer}{\mathbb{Z}} \newcommand{\rational}{\mathbb{Q}} \newcommand{\irrational}{\mathbb{I}} \newcommand{\complex}{\mathbb{C}} \newcommand{\cardinality}[1]{|#1|} \newcommand{\vec}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\mathbf{#1}} \newcommand{\star}[1]{#1^*} \newcommand{\inv}[1]{#1^{-1}} \newcommand{\indicator}[1]{\mathcal{I}(#1)} \renewcommand{\BigO}[1]{\mathcal{O}(#1)} \renewcommand{\BigOsymbol}{\mathcal{O}} \renewcommand{\smallo}[1]{\mathcal{o}(#1)} \renewcommand{\smallosymbol}[1]{\mathcal{o}} \newcommand{\set}[1]{\mathbb{#1}} \newcommand{\complement}[1]{#1^c} \newcommand{\powerset}[1]{\mathcal{P}(#1)} \newcommand{\setdiff}{\setminus} \newcommand{\setsymmdiff}{\oplus} \newcommand{\dash}[1]{#1^{'}} \newcommand{\permutation}[2]{{}_{#1} \mathrm{ P }_{#2}} \newcommand{\combination}[2]{{}_{#1} \mathrm{ C }_{#2}} \newcommand{\prob}[1]{P(#1)} \newcommand{\pmf}[1]{P(#1)} \newcommand{\pdf}[1]{p(#1)} \newcommand{\cdf}[1]{F(#1)} \newcommand{\expect}[2]{E_{#1}\left[#2\right]} \newcommand{\entropy}[1]{\mathcal{H}\left[#1\right]} \newcommand{\expe}[1]{\mathrm{e}^{#1}} \newcommand{\textexp}[1]{\text{exp}\left(#1\right)} \def\independent{\perp\!\!\!\perp} \def\notindependent{\not\!\independent} \newcommand{\yhat}{\hat{y}} \newcommand{\vs}{\vec{s}} \newcommand{\vt}{\vec{t}} \newcommand{\vu}{\vec{u}} \newcommand{\vv}{\vec{v}} \newcommand{\vw}{\vec{w}} \newcommand{\vx}{\vec{x}} \newcommand{\vy}{\vec{y}} \newcommand{\vz}{\vec{z}} \newcommand{\va}{\vec{a}} \newcommand{\vb}{\vec{b}} \newcommand{\vc}{\vec{c}} \newcommand{\vd}{\vec{d}} \newcommand{\ve}{\vec{e}} \newcommand{\vg}{\vec{g}} \newcommand{\vh}{\vec{h}} \newcommand{\vi}{\vec{i}} \newcommand{\vk}{\vec{k}} \newcommand{\vo}{\vec{o}} \newcommand{\vp}{\vec{p}} \newcommand{\vq}{\vec{q}} \newcommand{\vr}{\vec{r}} \newcommand{\vs}{\vec{s}} \newcommand{\vmu}{\vec{\mu}} \newcommand{\vsigma}{\vec{\sigma}} \newcommand{\vphi}{\vec{\phi}} \newcommand{\vtau}{\vec{\tau}} \newcommand{\vtheta}{\vec{\theta}} \newcommand{\mA}{\mat{A}} \newcommand{\mB}{\mat{B}} \newcommand{\mC}{\mat{C}} \newcommand{\mD}{\mat{D}} \newcommand{\mE}{\mat{E}} \newcommand{\mH}{\mat{H}} \newcommand{\mK}{\mat{K}} \newcommand{\mP}{\mat{P}} \newcommand{\mQ}{\mat{Q}} \newcommand{\mR}{\mat{R}} \newcommand{\mS}{\mat{S}} \newcommand{\mU}{\mat{U}} \newcommand{\mV}{\mat{V}} \newcommand{\mW}{\mat{W}} \newcommand{\mX}{\mat{X}} \newcommand{\mY}{\mat{Y}} \newcommand{\mZ}{\mat{Z}} \newcommand{\mI}{\mat{I}} \newcommand{\mLambda}{\mat{\Lambda}} \newcommand{\mSigma}{\mat{\Sigma}} \newcommand{\mTheta}{\mat{\theta}} \newcommand{\setsymb}[1]{#1} \newcommand{\sA}{\setsymb{A}} \newcommand{\sB}{\setsymb{B}} \newcommand{\sC}{\setsymb{C}} \newcommand{\sO}{\setsymb{O}} \newcommand{\sP}{\setsymb{P}} \newcommand{\sQ}{\setsymb{Q}} \newcommand{\sH}{\setsymb{H}} \newcommand{\sX}{\setsymb{X}} \newcommand{\sY}{\setsymb{Y}} \newcommand{\norm}[2]{||{#1}||_{#2}} \newcommand{\infnorm}[1]{\norm{#1}{\infty}} \newcommand{\fillinblank}{\text{ }\underline{\text{ ? }}\text{ }} \newcommand{\lbrace}{\left\{} \newcommand{\rbrace}{\right\}} \newcommand{\set}[1]{\lbrace #1 \rbrace} \newcommand{\seq}[1]{\left( #1 \right)} \newcommand{\ndim}{N} \newcommand{\ndimsmall}{n} \newcommand{\dataset}{\mathbb{D}} \newcommand{\ndata}{D} \newcommand{\ndatasmall}{d} \newcommand{\labeledset}{\mathbb{L}} \newcommand{\nlabeled}{L} \newcommand{\nlabeledsmall}{l} \newcommand{\unlabeledset}{\mathbb{U}} \newcommand{\nunlabeled}{U} \newcommand{\nunlabeledsmall}{u} \newcommand{\nclass}{M} \newcommand{\nclasssmall}{m} \newcommand{\loss}{\mathcal{L}} \newcommand{\sign}{\text{sign}} \newcommand{\Gauss}{\mathcal{N}} \newcommand{\hadamard}{\circ} \newcommand{\doh}[2]{\frac{\partial #1}{\partial #2}} \newcommand{\dox}[1]{\doh{#1}{x}} \newcommand{\doy}[1]{\doh{#1}{y}} \newcommand{\doxx}[1]{\doh{#1}{x^2}} \newcommand{\doyy}[1]{\doh{#1}{y^2}} \newcommand{\doxy}[1]{\frac{\partial #1}{\partial x \partial y}} \newcommand{\doyx}[1]{\frac{\partial #1}{\partial y \partial x}} \newcommand{\qed}{\tag*{$\blacksquare$}}\)

        Norm-based regularization
        Machine Learning
      

Introduction

Regularization is a collection of strategies that enable a learning algorithm to generalize better on new inputs, often times at the expense of reduced performance on the training set. In this sense, it is a strategy to reduce the possibility of overfitting the training data, and possibly reduce variance of the model by increasing some bias.

Some model families such as decision trees utilize regularization strategies that are specifically designed for their structure. Deep neural networks offer many alternative regularization strategies that we have explained in a comprehensive article focused on regularization in deep learning. Others, especially parametric models with weight vectors, may be regularized using norm-penalties on the weight vectors. In this article, we will cover these regularization strategies.

Prerequisites

To understand norm-based regularization, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance $ \vx \in \real^N $, a vector consisting of $ N $ features, $\vx = [x_1, x_2, \ldots, x_N] $.

We need to assign it to one of the $ M $ classes $ C_1, C_2, \ldots, C_M $ depending on the values of the $ N $ features .

Norm penalties

Several parametric machine learning models, such as logistic regression, support vector machines, least-squares regression, or deep neural networks, utilize weight parameters that are learned during the training phase.

One method of regularizing such parametric machine learning models is to constrain the parameter values. This can be achieved, for example, by applying a suitable norm as a penalty on the parameters or weights of the model. If $ \loss $ denotes the unregularized loss of the model, then we incorporate the regularization term $ \Omega(\vw) $ on the parameter $ \vw $ of the model, where $ \vw \in \real^\ndim $, so that $ \vw = [w_1,\ldots,w_\ndim] $.

$$ \loss_{\text{regularized}} = \loss + \alpha \Omega(\vw) $$

where, $ \alpha \in \real $ is a hyperparameter that controls the impact of the regularization term.

The overall norm penalty is a sum over the $ p$-norms of the parameters of the model.

$$ \Omega(\vw) = \norm{\vw}{p} $$

The $ p-$ norm is defined as

$$ \norm{\vw}{p} = \left[ \sum_{i=1}^\ndim |w_i|^p \right]^{\frac{1}{p}} $$

$ L_2 $-norm

A popular form of penalty on the weights is the $ L_2 $ norm, also known as weight decay in neural networks, which is applied on each weight parameter in the network.

where, the $ L_2 $ norm is defined as

$$ \norm{\vw}{2} = \sqrt{\sum_{i=1}^\ndim w_i^2} $$

Turns out, the $L_2$-norm is just the square root of the inner product of the vector. This means, $ \norm{\vw}{2} = \sqrt{\vw^T\vw} $.

The $ L_2 $ norm is also denoted as $ L^2 $-norm, $ \ell^2 $-norm. It is also known as the square norm, Euclidean norm, or $ 2$-norm.

The Euclidean norm has the effect of constraining all the elements of the parameter vector to be close to zero.

$ L_1 $-norm

As an alternative, the $L_1$-norm is also popular as a norm-based regularization strategy.

where, the $ L_1 $ norm is defined as the sum of absolute values of the elements of the parameter vector.

$$ \norm{\vw}{1} = \sum_{i=1}^\ndim |w_i| $$

Compared to the $L_2$ regularization, the $L_1$ regularization enforces sparsity in the parameter vector. This means, $L_1$-regularization results in parameter vectors with few non-zero terms. Thus, this approach to regularization has some inherent feature-selection capability because weights interacting with irrelevant features are automatically forced to be zero.

Max-norm

A recently popularized alternative is that of max-norm, also known as $L_\infty$-norm.

$$ \norm{\vw}{\infty} = \underset{i=1,\ldots,\ndim}{\max} |w_i| $$

It can be observed that the max-norm is equivalent to penalizing the parameter element with the maximum value.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please donate

Subscribe for article updates

Stay up to date with new material for free.

Norm-based regularization

Machine Learning