### Overview

Gradient Descent is an optimization function that is commonly used to minimize the error of a cost function. It does this by finding weights that best reduce the given error function. A brief overview of the process goes as followed:

  1) Weights are randomly initialized.

ascent) of the cost function in relation to each feature weight
is computed.  This is the gradient vector.

3) Weights are updated by substracting the weights by an alpha
value (a small step towards minimizing the cost function) * the

4) Steps 2 and 3 are repeated until convergence.



### Learning Rate

Alpha is the value that controls how big of a "step" is taken for each iteration. This value is important because if the value is too small then it will take too long to converge but if it is too big then it will never converge as it will jump back and forth over the minimum.

### Global and Local Minimum

The MSE cost function has what is called a convex shape (see image below). This means that it only has a global minimum. The starting point of the weights is not that important as the graph shows that it will always make it down the bowl and to the global minimum. For other problems the gradient shape can be much more complicated with local minimums or saddle ridges. Local minimums are "valleys" in the function that gradient descent may converge to based on where it is initialized. It leads to not having the best weights for the algorithm. Saddle ridges are plataues on the graph that can cause the problem to never converge. Stochastic and Mini Batch Gradient Descent are better at jumping out of these minimums compared to Batch Gradient Descent. ### Gradient Descent for Linear Regression

Take linear regression for example. The goal would be to minimize the error of mean squared error (MSE).

$$MSE(\textbf{X}, h_\theta) = \frac{1}{m} \sum_{i=1}^m (\theta^T \cdot \textbf{x}^{(i)} - y^{(i)})^2$$

In order to get the gradient of this function some calculus will have to be used. Partial derivatives are what allows us to get the gradient vector and the equation looks as follows:

$$\dfrac{\partial}{\partial \theta_j} \text{MSE}(\mathbf{\theta}) = \dfrac{2}{m}\sum\limits_{i=1}^{m}(\mathbf{\theta}^T \cdot \mathbf{x}^{(i)} - y^{(i)})\, x_j^{(i)}$$

This is the partial derivative with respect to each feature $\theta_j$. This equation if derived from the MSE equation above using the chain and power rule and it allows to get the gradient of all feature values $\theta_1$ to $\theta_j$.

Another way to write this is with an equation that computes for all of the gradient vectors at once. This is the Batch Gradient Descent implementation as it uses the full training set for its calculations ($\mathbf{X}$ represents the full training set):

$\nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta}) = \begin{pmatrix} \frac{\partial}{\partial \theta_0} \text{MSE}(\mathbf{\theta}) \\ \frac{\partial}{\partial \theta_1} \text{MSE}(\mathbf{\theta}) \\ \vdots \\ \frac{\partial}{\partial \theta_n} \text{MSE}(\mathbf{\theta}) \end{pmatrix} = \dfrac{2}{m} \mathbf{X}^T \cdot (\mathbf{X} \cdot \mathbf{\theta} - \mathbf{y})$

It should be noted that the only difference between this equation and the above is this equation computes every $\theta_j$ value at once while the equation before each $\theta_j$ is done separately.

The next step is updating the gradients by going the opposite direction of the gradient. This is because the gradient points uphill and the desired direction is downhill. Here is step 3 from the overview section:

$\mathbf{\theta}^{(\text{next step})} = \mathbf{\theta} - \eta \nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta})$

$\theta$ is all of the feature values which get substracted by the gradient vector $\nabla_{\mathbf{\theta}}\, \text{MSE}(\mathbf{\theta})$ multiplied by the size of the step $\eta$.

The above implementation for linear regression is the Batch Gradient Descent algorithm. One problem with it is that because it uses the whole training set it can be very slow for large datasets.