Logistic Regression

3 minute read

Author: Varun Sundar

Introduction

A particular case of discriminative approach (fit p (y|{x}) directly, where x is the feature vector, y is the prediction) to building a probabilistic classifier, logisitic regression (despite the name being a misnomer), assumes parameters to be linear.

Model Specification

Logistic regression corresponds to the following binary classification model:

often simplified to

using a bias.

Where is the hypothesis function.

For instance, in image classification it goes as follows:

Call

Notice that it is a bounded, one-one function.

import matplotlib.pyplot as plt
import numpy as np

def sigmoid(z):
    return 1/(1+np.exp(-z))

x=np.arange(-10,10,0.010)

plt.plot(x,sigmoid(x),color='red',lw=2)
plt.grid()
plt.show()

png

The derivative of is .

Cost Function

Just as in linear regression where the least squares cost function arose as a consequence (not entirely, one such derivation is the above), here also, maximum likelihood can be used to fit parameters.

Let us assume that

Or ineffect,

So, the likelihood (with Independently and Identically Distributed Random variable assumption, or IID)

which we want to maximise.

In turn we may maximise,

In standard practise, to reframe this as a convex optimisation probelm (Ex: prove the above function is conave wrt theta vector), we take the negative of the log likelihood as the cost function.

To arrive at ,

We may use gardient ascent

Notice that,

Implies,

Which looks similar in form to the rule for linear regression.

To obtain the perceptron learning algorithm from the above, we may just threshold it by the sigma function at 0.

Also, instead of gradient descent, the generalised Newton-Raphlson method may be used.

Note that despite having faster convergence per iteration, it can be computationally costly, owing to the step of inverting the Hessian matrix. This method is also known as Fischer scoring.

Implementation in Scikit

from sklearn.datasets import load_iris
iris = load_iris()
X, y = iris.data[:-1,:], iris.target[:-1]
from sklearn.linear_model import LogisticRegression
logistic = LogisticRegression()
logistic.fit(X,y)
print ('Predicted class %s, real class %s' % (
 logistic.predict(iris.data[-1,:]),iris.target[-1]))
print ('Probabilities for each class from 0 to 2: %s'
 % logistic.predict_proba(iris.data[-1,:]))
Predicted class [2], real class 2
Probabilities for each class from 0 to 2: [[ 0.00168787  0.28720074  0.71111138]]

Implementing in Python Raw, example

Random Data

I shall choose a multivariate normal distribution to go about this, plot it and then run logistic regression on it.

np.random.seed(10)
num_observations = 5000

#numpy.random.multivariate_normal(mean, cov[, size, check_valid, tol])

x1 = np.random.multivariate_normal([0, 0], [[1, .75],[.75, 1]], num_observations)
x2 = np.random.multivariate_normal([1, 5], [[1, .78],[.75, 1]], num_observations)

simulated_separableish_features = np.vstack((x1, x2)).astype(np.float32)
simulated_labels = np.hstack((np.zeros(num_observations),
                              np.ones(num_observations)))

plt.figure(figsize=(12,12))
plt.scatter(simulated_separableish_features[:, 0], simulated_separableish_features[:, 1],
            c = simulated_labels, alpha = .4)
plt.show()

png

Since the sigmoid function was already defined above, we can continue with the likelihood definition. The gradient can be done in-line, owing to its simplicity.

One apparent aspect is certainly the speed, which is terrible. Scikit’s implementation is highly optimised and certainly advisable in practice.

def log_likelihood(features, target, weights): ## x is features, theta is weights
    scores = np.dot(features, weights)  ##theta^T x
    likelihood = np.sum( target*scores - np.log(1 + np.exp(scores)) )
    return likelihood

def logistic_regression(features, target, num_steps, learning_rate, add_intercept = False):
    if add_intercept:
        intercept = np.ones((features.shape[0], 1))
        features = np.hstack((intercept, features))

    weights = np.zeros(features.shape[1])

    for step in range(num_steps):
        scores = np.dot(features, weights)
        predictions = sigmoid(scores)

        # Update weights with gradient
        output_error_signal = target - predictions
        gradient = np.dot(features.T, output_error_signal)
        weights += learning_rate * gradient

        # Print log-likelihood every 1000 steps
        if step % 10000 == 0:
            print (log_likelihood(features, target, weights))

    return weights

weights = logistic_regression(simulated_separableish_features, simulated_labels,
                     num_steps = 300000, learning_rate = 5e-5, add_intercept=True)

weights

-3998.81900443
-23.5309877516
-17.1770614174
-14.5269037733
-12.9842268935
-11.9427274266
-11.1774889491
-10.583569443
-10.104598614
-9.70723840557
-9.37033855207
-9.07974239698
-8.82556511211
-8.60066424819
-8.39973077754
-8.21872348452
-8.05450321516
-7.90458851753
-7.76698773522
-7.64008078293
-7.52253410072
-7.41323830466
-7.31126169798
-7.21581507865
-7.1262247319
-7.04191144547
-6.96237401947
-6.88717617369
-6.81593605338
-6.74831774398





array([-21.14270339,  -5.6457276 ,   9.6399825 ])
# Predictions

# appending intercept (bias) too to X (design) matrix
data = np.hstack((np.ones((simulated_separableish_features.shape[0], 1)),
                                 simulated_separableish_features))
final_scores = np.dot(data, weights)
preds = np.round(sigmoid(final_scores))

preds

array([ 0.,  0.,  0., ...,  1.,  1.,  1.])

Since, I haven’t partitioned data into a test set (Bad practice admittedly), and in additon, this was just to separate the two classes linearly , for just viewing the algorithm’s classification, we’ll proceed with the training data.

The figure below shows the accuracy, with violet dots being accurate predictions and red dots incorrect.

# Accuracy on training set

print ('Accuracy : {0}'.format((preds == simulated_labels).sum().astype(float) / len(preds)))

Accuracy : 0.9998
plt.figure(figsize = (12, 8))
plt.scatter(simulated_separableish_features[:, 0], simulated_separableish_features[:, 1],
            c = preds == simulated_labels - 1, alpha = .8, s = 50)

plt.show()

png

References

  1. CS229, Stanford ML : http://cs229.stanford.edu/
  2. CS231 N, Neural Networks for image recognition: http://cs231n.github.io/
  3. Probabilistic Approach to Machine Learning, Murphy, 1st Ed.

Leave a Comment