Gradient Descent


Gradient Descent is an optimization function that is commonly used to minimize the error of a cost function. It does this by finding weights that best reduce the given error function. A brief overview of the process goes as followed:

  1) Weights are randomly initialized.  

  2) The gradient (gradient points to the direction of steepest
  ascent) of the cost function in relation to each feature weight
  is computed.  This is the gradient vector.  

  3) Weights are updated by substracting the weights by an alpha
  value (a small step towards minimizing the cost function) * the
  gradient vector.

  4) Steps 2 and 3 are repeated until convergence.  

Learning Rate

Alpha is the value that controls how big of a "step" is taken for each iteration. This value is important because if the value is too small then it will take too long to converge but if it is too big then it will never converge as it will jump back and forth over the minimum.

Global and Local Minimum

The MSE cost function has what is called a convex shape (see image below). This means that it only has a global minimum. The starting point of the weights is not that important as the graph shows that it will always make it down the bowl and to the global minimum.

For other problems the gradient shape can be much more complicated with local minimums or saddle ridges. Local minimums are "valleys" in the function that gradient descent may converge to based on where it is initialized. It leads to not having the best weights for the algorithm. Saddle ridges are plataues on the graph that can cause the problem to never converge. Stochastic and Mini Batch Gradient Descent are better at jumping out of these minimums compared to Batch Gradient Descent.

Gradient Descent for Linear Regression

Take linear regression for example. The goal would be to minimize the error of mean squared error (MSE).

$$ MSE(\textbf{X}, h_\theta) = \frac{1}{m} \sum_{i=1}^m (\theta^T \cdot \textbf{x}^{(i)} - y^{(i)})^2 $$

In order to get the gradient of this function some calculus will have to be used. Partial derivatives are what allows us to get the gradient vector and the equation looks as follows:

$$ \dfrac{\partial}{\partial \theta_j} \text{MSE}(\mathbf{\theta}) = \dfrac{2}{m}\sum\limits_{i=1}^{m}(\mathbf{\theta}^T \cdot \mathbf{x}^{(i)} - y^{(i)})\, x_j^{(i)} $$

This is the partial derivative with respect to each feature $\theta_j$. This equation if derived from the MSE equation above using the chain and power rule and it allows to get the gradient of all feature values $\theta_1$ to $\theta_j$.

Another way to write this is with an equation that computes for all of the gradient vectors at once. This is the Batch Gradient Descent implementation as it uses the full training set for its calculations ($\mathbf{X}$ represents the full training set):

$ \nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta}) = \begin{pmatrix} \frac{\partial}{\partial \theta_0} \text{MSE}(\mathbf{\theta}) \\ \frac{\partial}{\partial \theta_1} \text{MSE}(\mathbf{\theta}) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \text{MSE}(\mathbf{\theta}) \end{pmatrix} = \dfrac{2}{m} \mathbf{X}^T \cdot (\mathbf{X} \cdot \mathbf{\theta} - \mathbf{y}) $

It should be noted that the only difference between this equation and the above is this equation computes every $\theta_j$ value at once while the equation before each $\theta_j$ is done separately.

The next step is updating the gradients by going the opposite direction of the gradient. This is because the gradient points uphill and the desired direction is downhill. Here is step 3 from the overview section:

$ \mathbf{\theta}^{(\text{next step})} = \mathbf{\theta} - \eta \nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta}) $

$\theta$ is all of the feature values which get substracted by the gradient vector $ \nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta}) $ multiplied by the size of the step $ \eta $.

Batch Gradient Descent

The above implementation for linear regression is the Batch Gradient Descent algorithm. One problem with it is that because it uses the whole training set it can be very slow for large datasets.

Mini Batch Gradient Descent

Instead of using the whole training Mini Batch uses small random sets of instances in order to compute the gradients which are called mini-batches. Because only a subset of instances are used the gradient doesn't slower get reduced until it reaches the minimum. It will bounce around a bit descreasing on average. This helps with problems when there are local minimums that need to get "jumped" out of and is quicker than Batch Gradient Descent as well. It does not actually stop at the minimum though as it will bounce around it instead. One way around this is to slowly reduce the learning rate so that it will stop bouncing around.

Stochastic Gradient Descent

Stochastic Gradient Descent goes a step further than Mini Batch and will only use one random instance at each step in order to compute the gradients. Just like Mini Batch this leads to jumping around with it decreasing towards the minimum on average. Because it bounces around more than Mini Batch it is even better at jumping out of local minimums and can be used on very large datasets unlike Batch. Reducing there learning slowly is also used for Stochastic Gradient Descent to try and get the optimal minimum.


  1. Hands-On Machine Learning with Scikit-Learn & Tensorflow by Aurlion Geron