In this article we will try and cover the basics of supervised learning and develop an intuition of what it really is and how it can be used.
In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
Supervised learning problems are categorized into “regression” and “classification” problems.
In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories.
To make things easier, let us see an easy example. Here is a graph of given data about the size of houses on the real estate market and we try to predict their price.
Now, see that this RAW DATA has already been provided to us. Now a friend of us calls and asks about the price he want to set to sell his house which has size of about 750 feet (as highlighted in green).
So, the RIGHT ANSWERS (that is the raw data corresponding to previous prices against their sizes) is already given and now we want to predict the model for any general X size(X is any positive number).
So now, let’s go ahead and try to make a curve which tries to satisfy the given points on the above graph.
There are many possibilities. Some of them being,
Well, these are just rough sketches, you should feel free to try on your own and guess any curve. Later, we will be learning how to find these curves using efficient and optimized algorithms.
The curve in the blue seems more promising than in the red, but that won’t be the case always!
The point is that we are predicting continuous valued output, like in this case, both our graphs are continuous in their respective domains. Thus, this model of predicting outputs based on previous data in a continuous valued output is called the Regression model of Supervised Learning.
Another model of Supervised Learning is Classification.
In a classification problem, we are instead trying to predict results in a discrete output.
In layman’s language, the data set which is already provided to us is generally discrete, i.e. in form of Yes/No, Numbers, Binary digits, etc.
An example to resolve the terrified above theory would be like this. Consider the following example of people suffering from Breast Cancer Tumours, and we divide them if they are Malignant or not.
The X axis shows the size of the tumour, while the Y axis shows the output.
Here, the red negatives are Non-malignant while the blue positive ones are malignant. The yellow boundary (which is technically called the DECISION BOIUNDARY) shows the division between the two subsets and thus decided that if the person has malignant tumour or Non-Malignant Tumour.
Thus, here we see that those which will be on the left side of the decision boundary will be having Non-malignant tumour and on the right-hand side will be having a malignant tumour (assuming the correctively of the data set and assuming the tumour becomes malignant as the size grows.)
Hence, given a patient with a tumour, predicting whether the tumour is malignant or benign is a problem of Classification model of Supervised Learning, where we are trying to map input variables into discrete categories, or in other words, creating the decision boundary between YES and NO.
Feel free to reach us, post your comments below.
Happy learning 🙂
Latest posts by Ayushmaan Seth (see all)
- The Gradient Descent Algorithm (With Codes) - October 25, 2017
- Prerequisite to Gradient Descent: Model Representation - August 15, 2017
- First step towards – Supervised Learning - July 30, 2017