# Regularization

## Introduction

Regularization is a collection of strategies that enable a learning algorithm to generalize better on new inputs, often times at the expense of reduced performance on the training set. In this sense, it is a strategy to reduce the possibility of overfitting the training data, and possibly reduce variance of the model by increasing some bias. Some regularization approaches are commonly used for other machine learning approaches, but as we will see in this article, some strategies are specifically designed for regularizing deep learning.

## Prerequisites

To understand regularization approaches for deep learning, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Parameter constraints

One method of regularizing deep neural networks is to constrain the parameter values, for example, by applying a suitable norm as a penalty on the parameters or weights of the model. If $\loss$ denotes the unregularized loss of the neural network, then we incorporate the regularization term $\Omega(\mTheta)$ on the parameters $\mTheta$ of the model.

$$\loss_{\text{regularized}} = \loss + \alpha \Omega(\mTheta)$$

where, $\alpha \in \real$ is a hyperparameter that controls the impact of the regularization term.

A popular form of penalty on the weights is the $L_2$ norm, also known as weight decay in neural networks, which is applied on each weight parameter in the network.

$$\Omega(\mTheta) = \sum_{\vw \in \mTheta} \norm{\vw}{2}$$

where, the $L_2$ norm is defined as $\norm{\vw}{2} = \vw^T\vw$.

Sparsity can be enforced in the model parameters by using an $L_1$ norm instead, just like in lasso regression.

$$\Omega(\mTheta) = \sum_{\vw \in \mTheta} \norm{\vw}{1}$$

where, the $L_1$ norm is defined as $\norm{\vw}{1} = \sum_{i} |w_i|$.

For more details on the effect of $L_2$ and $L_1$ regularization in machine learning models, refer to our comprehensive article on norm-based regularization in machine learning.

## Parameter tying

An enhanced form of parameter norm penalties is that of parameter tying. Intuitively, if we believe that certain parameters in the model should be close to each other, we can enforce this belief by specifically requiring them to be close in values. For example, if $\vw_a$ and $\vw_b$ are two weights in a network that we expect should be similar (maybe because they perform the same task or have similar input/output distributions), then we can enforce them to be closer by introducing a norm on their difference as an additional loss term.

$$\loss_{\text{tying}} = \loss + \alpha \norm{\vw_a - \vw_b}{2}$$

The example here shows the $L_2$ norm, but any other suitable norm may also be used.

Such parameter tying is particularly useful in the case of multitask learning approaches that typically end up with similar versions of the weight vectors for multiple related tasks. This has an added benefit of being able to learn each task with limited amount of training data because each task also receives additional supervision from the weight vectors of other related tasks.

## Parameter sharing

As an extreme variant of parameter tying, we may require that a set of parameters is not just close to each other, but that they are exactly the same! This effect is achieved by parameter sharing or using the same set of parameters at all the relevant locations in the network. Parameter sharing results in models with fewer effective parameters overall. This reduces the capacity of the model, thereby limiting the possibility of overfitting the training set, a desirable goal with regularization.

The most common example of parameter sharing is that of convolutional neural networks (CNN). In CNN, the same convolutional kernel is used across the entire image, resulting in significant reduction the number of parameters needed to train the model.

## Early stopping

The net effect of training longer on a dataset is overfitting that dataset, as the model parameters adapt severely to conform to the training data. Such models do not generalize well to new unseen data.

An obvious solution to this is, well, training for fewer epochs. This is the main idea behind early stopping. The art here is to train for just the right number of epochs. How do we do that?

The early stopping approach works as follows. At the end of each epoch, save the corresponding validation error and a snapshot of the current model parameters, as a check point. After completing the planned maximum number of training iterations, discover the best performing epoch and apply the model parameters from that epoch as the final trained model. It is also possible to terminate the training early by checking if the best validation error so far has not reduced any further in a predefined number of iterations.

An obvious benefit of early stopping is that it requires no change to the model or the loss function. This is particularly important compared to weight decay regularization wherein the restricting the weights severely may lead to poor performance. That being said, it may be used in conjunction with other regularization strategies such as parameter norm penalties.

One drawback of early stopping is the additional storage required to save a snapshot of model parameters after each epoch. Another drawback, particularly in training data starved problems, is the need to hold out a validation set that is separate from the training set. This can be a major problem as a separate validation implies fewer examples actually used for training. This issue can be mitigated by retraining on the entire dataset (including the validation set), once the best number of epochs have been discovered. But, whether to train for the same number of epochs, or the same number of parameter updates, is a subtle art!

## Sparse representations

An alternative approach to parameter norm penalties is that of sparse representations. Instead of penalizing the parameters, the sparse representation approach applies penalizing norms on activations of the layers in the model. Constraining the activations of a model layer indirectly implies a penalty on the model parameters, resulting in regularization.

If $\vh$ denotes the activations of the various layers in the network, then the updated loss can be

$$\loss_{\text{sparserepr}} = \loss + \alpha \sum_{\vh} \Omega(\vh)$$

We can use the now familiar $L_2$ or $L_1$ norms on these activations.

## Dropout

You may remember bootstrap aggregating, also known as bagging, from our comprehensive article on random forests. Bagging is an effective strategy to reduce variance, without introducing additional bias, of a predictor by building an ensemble of such models trained on randomly sampled subsets of the training dataset. Building an ensemble using bagging is also likely to help with reducing variance or other high capacity models, including deep neural networks. But unlike training a tree, neural network training is typically more computationally demanding. Building an ensemble of such models will likely be even more computationally expensive. Can we approximate the ensemble somehow? Yes, with the dropout strategy.

The dropout strategy works by randomly dropping out some units in the network during minibatch training. A unit can be dropped out of the network by zeroing the activations or output of that unit. This is typically achieved by multiplying the activations of all units (except the output units) in the network with a randomly sampled binary mask for each input example. For example, input layer units can be dropped with 0.5 probability and hidden layer units may be dropped with 0.2 probability.

The net effect of such random muting of units in the network is the creation of an inherent ensemble of exponentially many networks, formed of all possible networks resulting from the dropout. During prediction, the ensemble prediction can be obtained by averaging over the outputs of many networks resulting from the dropout. Typically 10 to 20 randomly sampled binary masks provide good variance reduction in the outputs.