Hello everyone! Welcome back to the 3rd article on basics of Machine Learning. In this article, we are going to first recap the pre-requisite to Gradient Descent Algorithm(i.e. what we did in the previous article) and then introduce one of the most important algorithm in Machine Learning – Gradient Descent. This algorithm is one of the most used and one of the most powerful algorithm used as practical applications of Machine Learning, such as Stock Prices prediction, or detection of Malignant Breast Cancer.
So, in our previous article we defined our notations for our cost function model by setting up our hypothesis [h(x)=θ+ϕ(x)] and then plugging in the value for the R squared error function,
Where, J (θ0, θ1) is the cost function and m are the number of training examples.
Now, we are going to define first algorithm in Machine Learning- Gradient Descent.
Gradient Descent is a Machine Learning algorithm to minimize the cost function J (θ0, θ1 ).
In layman terms, we minimize the above squared error function by using the basic calculus methods to calculate the parameters:
Okay, what is the above? Let us break down the above into different parts (Divide and Conquer 😊 )
- θj is the general value of the parameter, where j can be 0 for 1 (for the above hypothesis function)
- : = is the assignment operator, which is often used in pseudo-codes in various high-level languages/
- α – This symbol(alpha) is defined as the learning rate in the above equation. This parameter determines how fast or slow we will move towards the optimal weights. We will discuss more about the learning rate and its optimal values to be plugged in later in this article.
- What’s this term? Well, to understand this we need to introduce a little bit of calculus here.
The term involves partial derivative of the cost function J() , which means first differentiating it with respect to θ0 and after that differentiating w.r.t θ1 . Differentiating w.r.t one variable means treating the other variables as constant. More info on partial derivatives can be found here.
But then the question arises how this works? The thing is given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.
Let us quote this with the help of an example using Linear Regression. We had already discussed what regression is. Linear regression deals with the hypothesis function being linear (i.e of the form y = mx + c).
So let us come back to the problem where we want the computer to calculate the best-fit line given a line y = mx + c.
A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is. This function will take in a (m,c) pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each (x,y) point in our data set and sum the square distances between each point’s y value and the candidate line’s y value (computed at mx + c). It’s conventional to square this distance to ensure that it is positive and to make our error function differentiable. In python, computing the error for a given line will look like (following python code):
Okay, now we start the Gradient Descent. So, now we answer the question – If we minimize this function, we will get the best line for our data, and the best way to minimize is to differentiate.
So, first we compute something called the “Gradient”, which is actually the partial derivative term. This term will always help us moving closer to the optimal location(minimizing the function).
So, now we take our hypothesis function and plug in the data and after that partiuially differentiate w.r.t both and , we repeat this iteration until we find the optimal location.
The code in python is as below:
Now comes the final and the most important part of this algorithm- deciding on the learning rate,α.
Case 1 :If we take alpha very large, Gradient Descent can overshoot the minima and won’t even converge to a local minimum, instead it would diverge, as shown in the following graph.
Case 2: If we take alpha to be very small, then gradient descent will work, but it will proceed very slowly.
Case3 : If we keep the learning rate alpha fixed, then even gradient descent would work with a good amount of accuracy, as it will automatically take smaller steps to move towards the minimal location.
No fixed value of learning rate is generalized, but we always try to keep it between 0 and 1 as far as possible. This isn’t a hard and fast rule and completely depends on the training set and the data.
So i hope now you all have the basic knowledge of the working of the algorithm. Feel free to comment out your doubts. Happy learning 🙂
Latest posts by Ayushmaan Seth (see all)
- The Gradient Descent Algorithm (With Codes) - October 25, 2017
- Prerequisite to Gradient Descent: Model Representation - August 15, 2017
- First step towards – Supervised Learning - July 30, 2017