Logistic regression

Machine Learning

Introduction

The logistic regression classifier is a discriminative model for binary classification. Before the advent of deep learning and its easy-to-use libraries, the logistic regression classifier was one of the widely deployed classifiers for machine learning applications. It is still widely used for linear classification, especially in the medical domain.

Prerequisites

To understand the logistic regression classifier, we recommend familiarity with the concepts in

Follow the above links to first get acquainted with the corresponding concepts.

Problem setting

In classification, the goal of the predictive model is to identify the class that generated a particular instance.

Consider such an instance \( \vx \in \real^N \), a vector consisting of \( N \) features, \(\vx = [x_1, x_2, \ldots, x_N] \).

We need to assign it to one of the \( 2 \) classes \( C_1 \) and \( C_2 \) depending on the values of the \( N \) features .

Predictive model

Logistic regression uses a special form for the predictive model. It defines the probability of the first class as

$$ P(C_1|\vx) = \sigma(\vw^T \vx + b) $$

where \( \vw \in \real^\ndim \) and \( b \in \real \) are parameters of the logistic regression classifier. The parameter \( \vw \) is also known as the weights of the model. In the above Equation, the \( \sigma \) is a special function known as the logistic sigmoid function. We will describe its nature shortly.

Since this is a binary classification setting, the probability of belonging to the other class is

$$ P(C_2|\vx) = 1 - P(C_1|\vx) $$

Thus, the predicted class label \( \yhat \) is the class with the highest probability.

$$ \yhat = \argmax_{c \in \set{C_1,C_2}} P(C=c | \vx) $$

Logistic sigmoid function

The logistic regression name comes from the logistic sigmoid function used in the predictive model. The sigmoid function is also quite commonly used as an activation function in neural networks.

The sigmoid function is defined as

$$ \sigma(a) = \frac{1}{1 + \expe{-a}} $$

In the case of logistic regression, it would be

$$ \sigma(\vw^T\vx + b) = \frac{1}{1 + \expe{-(\vw^T \vx + b)}} $$

Nature of the sigmoid function

Note that the predictive model uses sigmoid function to calculate probability. This is possible because the output of the sigmoid function is bounded in the range \( [0,1] \).

The sigmoid function output is \( 0.5 \) only when its input is \( 0 \). For positive inputs, the sigmoid returns values in the range \( (0.5,1] \). For negative inputs, the sigmoid returns values in the range \( [0,0.5) \).

Nature of the predictive model

Now, let us try to understand the effect of changing the weight vector \( \vw \) and the bias \( b \) on the predictive model.

The weight vector \( \vw = [w_1, w_2] \). Drag the cricle to change the vector
\( \sigma(\vw^T\vx + b) \)
Classification model

Observations about the predictive model

If you have tried the interactive demo of the predictive model in the previous section, you should note a few things.

  • The weight vector \( \vw \) is always perpendicular to the decision boundary, the so-called separating hyperplane between the orange and blue classes. Thus, a rotation of the weight vector results in a corresponding rotation of the decision boundary.
  • The weight vector \( \vw \) points in the direction of increasing value of the function \( \sigma (\vw^T\vx + b) \).
  • Increasing the magnitude of the vector (dragging it in the same direction away from the origin) does not change the decision boundary. This intuition plays a key role in regularization, as we shall see in the norm-based regularization in machine learning.
  • The bias term \( b \) has the net effect of sliding the decision boundary away from the origin. When \( b = 0 \), the decision boundary passes through the origin.

Training logistic regression

Training a logistic regression classifier involves discovering suitable values for the parameters — \( \vw \) and \( b \).

The parameters are optimized for maximizing the likelihood of observed data — the training set — by (maximum likelihood estimation) (MLE).

Suppose \( \labeledset = \set{(\vx_1, y_1), \ldots, (\vx_\nlabeled, y_\nlabeled)} \) denotes the training set consisting of \( \nlabeled \) training instances. Assume that \( y_\nlabeledsmall = 1 \) if \( \vx_\nlabeledsmall \) belongs to the class \( C_1 \), and zero if it belongs to the class \( C_2 \). The likelihood in the case of logistic regression is

$$ P(\labeledset|\vw) = \prod_{\nlabeledsmall=1}^\nlabeled P(C_1|\vx_\nlabeledsmall)^{y_\nlabeledsmall} \left(1 - P(C_1|\vx_\nlabeledsmall)\right)^{1 - y_\nlabeledsmall} $$

Note that the term \(P(C_1|\vx_\nlabeledsmall)^{y_\nlabeledsmall} \) is activated if the instance \( \vx_\nlabeledsmall \) belongs to the class \( C_1 \), since \(y_\nlabeledsmall = 1 \) in that case.

The latter term \( \left(1 - P(C_1|\vx_\nlabeledsmall)\right)^{1 - y_\nlabeledsmall} \) is activated when the training instance \( \vx_\nlabeledsmall \) belongs to the class \(C_2\), since \(y_\nlabeledsmall = 0 \) in that case.

With the likelihood function defined this way, the training proceeds by MLE using popular optimization approaches. Particularly in the case of logisitc regression, there is no closed-form solution to the MLE optimization problem. Hence, iterative approaches such as minibatch stochastic gradient descent may be used.

Training demo

Let us fit a logistic regression model to some training data. Since the model parameters do not have a closed form solution, we will use an iterative approach with minibatch-SGD to minimize the negative log-likelihood.

Fitting logistic regression to training data
Training accuracy per training epoch

Being a linear classifier, the trained model comfortably separates linearly separable classes. However, it is ineffective at non-linearly separable scenarios.

Dealing with feature types

Note that the predictive model involves a dot product of the weight vector \( \vw \) and the instance vector \( \vx \). This is easy for binary and continuous features since both can be treated as real-valued features.

In the case of categorical features a direct dot product with the weight vector is not meaningful. Therefore, we need to first preprocess the categorical variables using one-hot encoding to arrive at a binary feature representation.

Please support us

Help us create more engaging and effective content and keep it free of paywalls and advertisements!

Let's connect

Please share your comments, questions, encouragement, and feedback.