# Rosenblatt's perceptron algorithm

## Introduction

The perceptron learning algorithm of Frank Rosenblatt is an important precursor to modern day neural networks. It is a linear discriminative binary classifier.

## Prerequisites

To understand the Perceptron classifier, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

## Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance $\vx \in \real^N$, a vector consisting of $N$ features, $\vx = [x_1, x_2, \ldots, x_N]$.

We need to assign it to one of the $2$ classes depending on the values of the $N$ features.

## Predictive model

The predictive model of the perceptron is

$$\yhat = f \left(\vw^T \vx + b \right) \label{eqn:class-pred}$$

where,

• $\vw \in \real^{N}$ is the parameter, the so-called weights of the model,
• $b$ is the bias of the model.
• $f$ is known as the activation function. It is a step function of the form

It is common in the literature to include the bias term in the weights and concatenate a corresponding element $1$ to the input $\vx$. For simplicity, we will follow that convention and henceforth, write $\vw^T \vx$ to mean the full expanded form $\vw^T \vx + b$.

$$f(a) = \begin{cases} +1, ~~~~a \ge 0, \\\\ -1, ~~~~a < 0. \end{cases}$$

Owing to this form of the step activation function, without loss of generality, the target binary class labels for the perceptron model are encoded as $+1$ and $-1$, instead of the more typical $1$ and $0$.

## Nature of the predictive model

Now, let us try to understand the effect of changing the weight vector $\vw$ and the bias $b$ on the predictive model of the Perceptron classifier.

## Observations about the predictive model

If you have tried the interactive demo of the predictive model in the previous section, you should note a few things.

• The weight vector $\vw$ is always perpendicular to the decision boundary, the so-called separating hyperplane between the orange and blue classes. The separating hyperplane is indicated with a red line. Thus, a rotation of the weight vector results in a corresponding rotation of the decision boundary.
• The weight vector $\vw$ points in the direction of increasing value of the function $\vw^T\vx + b$.
• Increasing the magnitude of the vector (dragging it in the same direction away from the origin) does not change the decision boundary. This intuition plays a key role in regularization, as we shall see in the norm-based regularization in machine learning.
• The bias term $b$ has the net effect of sliding the decision boundary away from the origin. When $b = 0$, the decision boundary passes through the origin.

## The perceptron criterion

As with most machine learning models, the weights are fit to the model by loss minimization. Note that perceptron is a precursor to the more evolved neural networks and deep learning models of recent times. The popular loss functions of today, such as those based on negative log-likelihood or cross-entropy were not well-studied.

A potential loss function in the case of the perceptron is the total number of misclassified examples. Although simple to calculate, this has a major challenge for optimization: the loss is a piecewise constant function of $\vw$. It has discontinuities whenever a minor change in $\vw$ changes the decision boundary. Therefore, gradient-based optimization methods could not be applied because the gradient is zero almost everywhere.

Instead, Rosenblatt came up with a criterion that the weights should obey. Note that for a positive instance $p$, it is the case that $y_p = +1$ and $f(\vw^T \vx_p) \ge 0$. Similarly, for a negative instance, $n$, it is the case that $y_n = -1$ and $f(\vw^T \vx_n) < 0$.

This means, for any instance $i$, it is the case that

$$f(\vw^T \vx_i) y_i \ge 0$$

Moreover, since the activation function is effectively just the signum function, it is the case that

$$\vw^T \vx_i y_i \ge 0$$

Thus, if the prediction $f(\vw^T \vx_i)$ does not match the true label $y_i$, the above inequality will be violated. Rosenblatt identified that and defined the perceptron criterion as

$$\mathcal{L}_{\text{perceptron}} (\vw) = -\sum_{i \in \mathcal{M}_{\vw}} \vw^T \vx_i y_i$$

where, $\mathcal{M}_{\vw}$ denotes the set of all instances that are misclassfied with the weight vector $\vw$. Note the negative sign. The more class labels that are violated, the higher the loss. A suitable loss function to minimize.

## Training the perceptron

Using this perceptron criterion as the loss function, the learning followed using stochastic gradient descent, one instance at a time. If a training instance $(\vx_i, y_i)$ got misclassified, the weight vector was updated to rectify that misclassification. The update function is

\begin{align} \vw_{(t+1)} &= \vw_{t} - \eta \nabla \mathcal{L}_{\text{perceptron}} (\vw) \\\\ &= \vw_{t} + \eta \vx_i y_i \end{align}

That is a simple update rule!

## Training demo

Let us fit a Perceptron model to some training data.

Being a linear classifier, the trained model comfortably separates linearly separable classes. However, the model is ineffective at non-linearly separable scenarios.

## Perceptron convergence theorem

Although simple, the update rule for learning the perceptron had performance guarantees in the linear case. The perceptron convergence theorem proved that if the data was linearly separable,

• the perceptron learning algorithm is guaranteed to find an exact solution
• the learning algorithm will require a finite number of steps for an exact solution.

That being said, the finite number of steps could still be a large number. Moreover, it will find an exact solution, any solution that perfectly separates the training examples from the two classes. Not necessarily the optimal solution, as we will see in the case of more evolved regularized models such as the support vector machine.

## Dealing with feature types

Note that the predictive model involves a simple dot product between the weight vector $\vw$ and the instance $\vx$. This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.