# Activation functions

## Introduction

We motivated the need for activation functions in our comprehensive article on multilayer percetrons. In that article, we provided the example of the ReLU activation function as a way of incorporating nonlinearity into the activations of the hidden layers of the deep feedforward network.

In this article, we list some other activation functions and provide a commentary on the appropriate scenario for their usage.
We have specifically included the activation functions that are commonly used in deep neural networks. For a more comprehensive listing of activation functions refer to the corresponding Wikipedia article.

## Prerequisites

To understand the need for activation functions, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Identity

The identity activation function returns its input as it is.

\begin{equation} \text{identity}(a) = a \label{eqn:identity} \end{equation}

It is the simplest of all activation functions but does not impart any particular characteristic to the input. It is mostly reserved for output layers, especially in the case of real-valued regression problems.

## Sigmoid: $\sigma$

The sigmoid activation, typically denoted as $\sigma(a)$, is a nonlinear activation function with the range $[0,1]$.

\begin{equation} \sigma(a) = \frac{1}{1 + \expe{-a}} \label{eqn:sigmoid} \end{equation}

It is commonly used for gates in LSTMs and GRUs. It can also be used for probabilistic outputs because it is always positive and less than 1.

It is also known as the logistic or soft-step activation function.

## Nature of the sigmoid function

The sigmoid function is an apt choice for predicting a probabilistic output. This is possible because the output of the sigmoid function is bounded in the range $[0,1]$.

The sigmoid function output is $0.5$ only when its input is $0$. For positive inputs, the sigmoid returns values in the range $(0.5,1]$. For negative inputs, the sigmoid returns values in the range $[0,0.5)$.

## Hyperbolic tangent: $\text{tanh}$

The hyperbolic tangent activation, typically denoted as $\text{tanh}(a)$, is a nonlinear activation function with the range $[-1,1]$. It is quite similar to the sigmoid activation function, but allows for negative values.

\begin{equation} \text{tanh}(a) = \frac{\expe{a} - \expe{-a}}{\expe{a} + \expe{-a}} = \frac{\expe{2a} - 1}{\expe{2a} + 1} \label{eqn:tanh} \end{equation}

## Nature of the hyperbolic tangent function

The sigmoid function is an apt choice for predicting a probabilistic output. This is possible because the output of the sigmoid function is bounded in the range $[0,1]$.

The sigmoid function output is $0.5$ only when its input is $0$. For positive inputs, the sigmoid returns values in the range $(0.5,1]$. For negative inputs, the sigmoid returns values in the range $[0,0.5)$.

## ReLU

Rectified linear unit (ReLU) is a piecewise linear function that assigns zero to negative input and keeps positive input unchanged. It is typically denoted as its acronym $\text{ReLU}$.

\begin{equation} \text{ReLU}(a) = \max\set{0,a} \label{eqn:relu} \end{equation}

ReLU is the default recommendation for all hidden layers in modern deep neural networks. Multiple stacked layers with ReLU activations enable the modeling of any nonlinearity due to the piecewise linearity of this activation function.

## Leaky ReLU

ReLU is harsh on negative inputs. It returns zero for negative inputs. This rigidity results in dead units — units whose activation is always zero.

A milder alternative is the leaky ReLU, defined as follows:

\begin{equation} \text{ReLU}(a) = \begin{cases} & 0.01 a ~~~~& \text{ for }~ a < 0 \\\\ & a ~~~~& \text{ for }~ a \ge 0 \end{cases} \label{eqn:leaky-relu} \end{equation}

Thus, negative values are reduced in magnitude, but still manage to pass through, thereby preventing dead units.

## Parametric ReLU: PReLU

The leaky ReLU discussed above makes an arbitrary choice of returning $0.01 a$ when $a < 0$. The multiplier $0.01$ can instead by parametrized with a learnable parameter $\alpha$ that can be adapted during learning phase, just as any parameter of the model.

\begin{equation} \text{ReLU}(a) = \begin{cases} & \alpha a ~~~~& \text{ for }~ a < 0 \\\\ & a ~~~~& \text{ for }~ a \ge 0 \end{cases} \label{eqn:prelu} \end{equation}

## SoftPlus

ReLU, Leaky ReLU, and PReLU are not differentiable at zero. A softer alternative that is differentiable, but has a behaviour roughly similar to ReLU is the SoftPlus activation function.

\begin{equation} \text{SoftPlus}(a) = \ln \left(1 + \expe{-a} \right) \label{eqn:softplus} \end{equation}

In spite of this differentiable behavior, it is still the case that ReLU is preferred and default choice in neural networks. It often works well enough in practice and is super cheap to compute.