Logistic Regression

Math behind the Algorithm

Logistic Regression is used to predict the probability that a instance belongs to certain class. It is a binary classifier (two classes) although it can be modified using the softmax function to classify any number of classes.

Just like Linear Regression it uses the sum of weighted features plus a bias in order to come up with its probabilities. The difference from linear regression is that the sigmoid function is used to produce a probability value between 0 and 1. Below is the Logistic Regression equation:

$$ \hat{p} = h_{\mathbf{\theta}}(\mathbf{x}) = \sigma(\mathbf{\theta}^T \cdot \mathbf{x}) $$


$ \sigma $ is called the logistic and is a sigmoid function that produces a number between 0 and 1.

Logistic Function : $ \sigma(t) = \dfrac{1}{1 + \exp(-t)} $


sigmoid:wikipedia

The prediction is then made based on $\hat{p}$. If it is > 0.5 then the 0 class is predicted otherwise the class is 1 (also called the positive class). You can see that when the sigmoid value is greater than 0.5 it is a positive value and when it is less than 0.5 it has a negative value. That is why class 1 is called the positive class.

Cost Function

The cost function used for logistic regression is called the log loss function. The goal is to have a high $ \hat{p} $ value for when y = 1 (positive class) and a low $ \hat{p} $ value when y = 0. Looking at each case (y=1 and y=0) we can see the intuitive nature of the logs and how this cost function can achieve this goal.

log-loss:cognitree.com

For the positive case (y=1) the second part of the log loss equation equation is canceled out leaving $ -log\left(\hat{p}^{(i)}\right) $. If the $ \hat{p} $ is close to 0 (which leads to a wrong guess) the log value will be very high leading to large cost value and bigger penalty. On other hand if the $ \hat{p} $ is closer to 1 then the log value approaches zero leading to a smaller cost. The same thing happens for the negative case (y=0). The equation ends up being $ - log\left(1 - \hat{p}^{(i)}\right) $. When y is close to 1 (wrong guess) the log value is large and therefore the cost value is large and then y is closer to zero the cost is reduced.

$$ J(\mathbf{\theta}) = -\dfrac{1}{m} \sum\limits_{i=1}^{m}{\left[ y^{(i)} log\left(\hat{p}^{(i)}\right) + (1 - y^{(i)}) log\left(1 - \hat{p}^{(i)}\right)\right]} $$

Minimizing the Cost Function

Gradient descent is usually used to minimize the log loss function. The partial derivative of the cost function looks similar to the linear regression equation:

$$ \dfrac{\partial}{\partial \theta_j} \text{J}(\mathbf{\theta}) = \dfrac{1}{m}\sum\limits_{i=1}^{m}\left(\mathbf{\sigma(\theta}^T \cdot \mathbf{x}^{(i)}) - y^{(i)}\right)\, x_j^{(i)} $$

This link provides a clear explanation of how the derivation of the cost function results in the above equation.

From here any of the Gradient Descent algorithms (Batch, Mini Batch, Stochastic) can be used in order to minimize the cost function for the model.

Softmax

The logistic regression equation can be modified in a way that n-classes can be classified. It will have a probability for each class where the sum of the probabilities equal 1. The class that has the highest probability will then be the predicted class on that particular instance.

The first step is that for every instance $ \mathbf{x} $ a softmax score is computed for each class k. Below is how the softmax score for a class k is computed:

$$ s_k(\mathbf{x}) = ({\mathbf{\theta}^{(k)}})^T \cdot \mathbf{x} $$


The next step is using the softmax scores of each class in the softmax function. Here the probability that the particular instance belongs to the given class k are formulated:

$$ \hat{p}_k = \sigma\left(\mathbf{s}(\mathbf{x})\right)_k = \dfrac{\exp\left(s_k(\mathbf{x})\right)}{\sum\limits_{j=1}^{K}{\exp\left(s_j(\mathbf{x})\right)}} $$

Cost Function

Softmax uses the cross entropy function. It is the same concept of the log loss equation used for logistic regression and they are actually the same equation if k is equal to 2.

$$ J(\mathbf{\Theta}) = - \dfrac{1}{m}\sum\limits_{i=1}^{m}\sum\limits_{k=1}^{K}{y_k^{(i)}\log\left(\hat{p}_k^{(i)}\right)} $$

Resources:

  1. Hands-On Machine Learning with Scikit-Learn & Tensorflow by Aurelien