## Introduction

The perceptron learning algorithm of Frank Rosenblatt is an important precursor to modern day neural networks. It is a linear discriminative binary classifier.

This learning module has many interactive demos. It is easier to work with them on a larger screen.
Bookmark and revisit if you are currently on a small screen device.

The perceptron learning algorithm of Frank Rosenblatt is an important precursor to modern day neural networks. It is a linear discriminative binary classifier.

To understand the Perceptron classifier, we recommend familiarity with the concepts in

- Introduction to machine learning: An introduction to basic concepts in machine learning such as classification, training instances, features, and feature types.
- Stochastic gradient descent.

Follow the above links to first get acquainted with the corresponding concepts.

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).

We need to assign it to one of the \( 2 \) classes depending on the values of the \( N \) features.

The predictive model of the perceptron is

\begin{equation} \yhat = f \left(\vw^T \vx + b \right) \label{eqn:class-pred} \end{equation}

where,

- \( \vw \in \real^{N} \) is the parameter, the so-called
**weights**of the model, - \( b \) is the bias of the model.
- \( f \) is known as the
**activation function**. It is a step function of the form

It is common in the literature to include the bias term in the weights and concatenate a corresponding element \( 1 \) to the input \( \vx \). For simplicity, we will follow that convention and henceforth, write \( \vw^T \vx \) to mean the full expanded form \( \vw^T \vx + b \).

\begin{equation} f(a) = \begin{cases} +1, ~~~~a \ge 0, \\\\ -1, ~~~~a < 0. \end{cases} \end{equation}

Owing to this form of the step activation function, without loss of generality, the target binary class labels for the perceptron model are encoded as \( +1 \) and \( -1 \), instead of the more typical \( 1 \) and \( 0 \).

Now, let us try to understand the effect of changing the weight vector \( \vw \) and the bias \( b \) on the predictive model of the Perceptron classifier.

If you have tried the interactive demo of the predictive model in the previous section, you should note a few things.

- The weight vector \( \vw \) is always perpendicular to the decision boundary, the so-called
**separating hyperplane**between the orange and blue classes. The separating hyperplane is indicated with a red line. Thus, a rotation of the weight vector results in a corresponding rotation of the decision boundary. - The weight vector \( \vw \) points in the direction of increasing value of the function \( \vw^T\vx + b \).
- Increasing the magnitude of the vector (dragging it in the same direction away from the origin) does not change the decision boundary. This intuition plays a key role in regularization, as we shall see in the
**norm-based regularization**in machine learning. - The bias term \( b \) has the net effect of sliding the decision boundary away from the origin. When \( b = 0 \), the decision boundary passes through the origin.

As with most machine learning models, the weights are fit to the model by loss minimization. Note that perceptron is a precursor to the more evolved neural networks and deep learning models of recent times. The popular loss functions of today, such as those based on negative log-likelihood or cross-entropy were not well-studied.

A potential **loss function** in the case of the perceptron is the *total number of misclassified examples*.
Although simple to calculate, this has a major challenge for optimization: the loss is a piecewise constant function of \( \vw \).
It has discontinuities whenever a minor change in \( \vw \) changes the decision boundary.
Therefore, gradient-based optimization methods could not be applied because the gradient is zero almost everywhere.

Instead, Rosenblatt came up with a *criterion* that the weights should obey.
Note that for a positive instance \( p \), it is the case that \(y_p = +1 \) and \( f(\vw^T \vx_p) \ge 0 \).
Similarly, for a negative instance, \( n \), it is the case that \( y_n = -1 \) and \( f(\vw^T \vx_n) < 0 \).

This means, for any instance \( i \), it is the case that

$$ f(\vw^T \vx_i) y_i \ge 0 $$

Moreover, since the activation function is effectively just the *signum* function, it is the case that

$$ \vw^T \vx_i y_i \ge 0 $$

Thus, if the prediction \( f(\vw^T \vx_i) \) does not match the true label \( y_i \), the above inequality will be violated.
Rosenblatt identified that and defined the **perceptron criterion** as

$$ \mathcal{L}_{\text{perceptron}} (\vw) = -\sum_{i \in \mathcal{M}_{\vw}} \vw^T \vx_i y_i $$

where, \( \mathcal{M}_{\vw} \) denotes the set of all instances that are misclassfied with the weight vector \( \vw \). Note the negative sign. The more class labels that are violated, the higher the loss. A suitable loss function to minimize.

Using this perceptron criterion as the loss function, the learning followed using stochastic gradient descent, one instance at a time. If a training instance \( (\vx_i, y_i) \) got misclassified, the weight vector was updated to rectify that misclassification. The update function is

\begin{align} \vw_{(t+1)} &= \vw_{t} - \eta \nabla \mathcal{L}_{\text{perceptron}} (\vw) \\\\ &= \vw_{t} + \eta \vx_i y_i \end{align}

That is a simple update rule!

Let us fit a Perceptron model to some training data.

Being a linear classifier, the trained model comfortably separates linearly separable classes. However, the model is ineffective at non-linearly separable scenarios.

Although simple, the update rule for learning the perceptron had performance guarantees in the linear case.
The **perceptron convergence theorem** proved that if the data was linearly separable,

- the perceptron learning algorithm is guaranteed to find an exact solution
- the learning algorithm will require a finite number of steps for an exact solution.

That being said, the *finite* number of steps could still be a large number.
Moreover, it will find *an* exact solution, any solution that perfectly separates the training examples from the two classes.
Not necessarily the optimal solution, as we will see in the case of more evolved regularized models such as the support vector machine.

Note that the predictive model involves a simple dot product between the weight vector \( \vw \) and the instance \( \vx \). This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible.
Therefore, we need to first preprocess the categorical variables using **one-hot encoding** to arrive at a binary feature representation.

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Please share your comments, questions, encouragement, and feedback.