Introduction
The perceptron learning algorithm of Frank Rosenblatt is an important precursor to modern day neural networks. It is a linear discriminative binary classifier.
The perceptron learning algorithm of Frank Rosenblatt is an important precursor to modern day neural networks. It is a linear discriminative binary classifier.
To understand the Perceptron classifier, we recommend familiarity with the concepts in
Follow the above links to first get acquainted with the corresponding concepts.
In classification, the goal of the predictive model is to identify the class that generated a particular instance.
Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).
We need to assign it to one of the \( 2 \) classes depending on the values of the \( N \) features.
The predictive model of the perceptron is
\begin{equation} \yhat = f \left(\vw^T \vx + b \right) \label{eqn:class-pred} \end{equation}
where,
It is common in the literature to include the bias term in the weights and concatenate a corresponding element \( 1 \) to the input \( \vx \). For simplicity, we will follow that convention and henceforth, write \( \vw^T \vx \) to mean the full expanded form \( \vw^T \vx + b \).
\begin{equation} f(a) = \begin{cases} +1, ~~~~a \ge 0, \\\\ -1, ~~~~a < 0. \end{cases} \end{equation}
Owing to this form of the step activation function, without loss of generality, the target binary class labels for the perceptron model are encoded as \( +1 \) and \( -1 \), instead of the more typical \( 1 \) and \( 0 \).
Now, let us try to understand the effect of changing the weight vector \( \vw \) and the bias \( b \) on the predictive model of the Perceptron classifier.
If you have tried the interactive demo of the predictive model in the previous section, you should note a few things.
As with most machine learning models, the weights are fit to the model by loss minimization. Note that perceptron is a precursor to the more evolved neural networks and deep learning models of recent times. The popular loss functions of today, such as those based on negative log-likelihood or cross-entropy were not well-studied.
A potential loss function in the case of the perceptron is the total number of misclassified examples. Although simple to calculate, this has a major challenge for optimization: the loss is a piecewise constant function of \( \vw \). It has discontinuities whenever a minor change in \( \vw \) changes the decision boundary. Therefore, gradient-based optimization methods could not be applied because the gradient is zero almost everywhere.
Instead, Rosenblatt came up with a criterion that the weights should obey. Note that for a positive instance \( p \), it is the case that \(y_p = +1 \) and \( f(\vw^T \vx_p) \ge 0 \). Similarly, for a negative instance, \( n \), it is the case that \( y_n = -1 \) and \( f(\vw^T \vx_n) < 0 \).
This means, for any instance \( i \), it is the case that
$$ f(\vw^T \vx_i) y_i \ge 0 $$
Moreover, since the activation function is effectively just the signum function, it is the case that
$$ \vw^T \vx_i y_i \ge 0 $$
Thus, if the prediction \( f(\vw^T \vx_i) \) does not match the true label \( y_i \), the above inequality will be violated. Rosenblatt identified that and defined the perceptron criterion as
$$ \mathcal{L}_{\text{perceptron}} (\vw) = -\sum_{i \in \mathcal{M}_{\vw}} \vw^T \vx_i y_i $$
where, \( \mathcal{M}_{\vw} \) denotes the set of all instances that are misclassfied with the weight vector \( \vw \). Note the negative sign. The more class labels that are violated, the higher the loss. A suitable loss function to minimize.
Using this perceptron criterion as the loss function, the learning followed using stochastic gradient descent, one instance at a time. If a training instance \( (\vx_i, y_i) \) got misclassified, the weight vector was updated to rectify that misclassification. The update function is
\begin{align} \vw_{(t+1)} &= \vw_{t} - \eta \nabla \mathcal{L}_{\text{perceptron}} (\vw) \\\\ &= \vw_{t} + \eta \vx_i y_i \end{align}
That is a simple update rule!
Let us fit a Perceptron model to some training data.
Being a linear classifier, the trained model comfortably separates linearly separable classes. However, the model is ineffective at non-linearly separable scenarios.
Although simple, the update rule for learning the perceptron had performance guarantees in the linear case. The perceptron convergence theorem proved that if the data was linearly separable,
That being said, the finite number of steps could still be a large number. Moreover, it will find an exact solution, any solution that perfectly separates the training examples from the two classes. Not necessarily the optimal solution, as we will see in the case of more evolved regularized models such as the support vector machine.
Note that the predictive model involves a simple dot product between the weight vector \( \vw \) and the instance \( \vx \). This is easy for binary and continuous features since both can be treated as real-valued features.
In the case of categorical features a direct metric score calculation is not possible. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.
Help us create more engaging and effective content and keep it free of paywalls and advertisements!
Please share your comments, questions, encouragement, and feedback.