Hola Data Scientists! Hope you’re having a wonderful day.
Welcome to Part 1 of “ML Algos from Scratch” series. This series aims at building an intuition about machine learning algorithms, from how it works and what happens under the hood, to its implementation in Python.
In this part, we will learn about a very popular method of Supervised Learning called as K Nearest Neighbors (KNN). KNN is an algorithm mainly used for pattern recognition and as a classification technique. It is mostly used due to its low calculation time and easy to understand logic.
Let’s suppose we have two different groups of people named Red and Blue living on the
Now, a new person(let’s name him “Green”) joins the society and he has to get into a group which is in majority in his neighborhood.
To decide which group to go in, Green calculates the distance between his home(coordinates) and every other person. Now, he finds out the 3(‘k’) nearest neighbors and then finds the group in which the majority of the 3 are.
You can see that out of the 3 nearest neighbors, 2 are “red” and 1 is “blue”. So, “green” joins the “red” group. However, if we chose the value of k to be 5, then there would be 3 “blue” and 2 “red” neighbors, and “green” would have joined the “blue” group.
This is the main fundamental of KNN. Grouping a new data into some previously defined groups on the basis of its proximity to that group. Here k stands for the number of nearest neighbors that we checked for. It can be any number but we take ‘k’ as an odd number to remove the conflict in case two groups get the same number of nearest neighbors!
Now, let’s learn how to implement this in Python.
First we import all the neccesary modules in our iPython Notebook file!
import numpy as np from sklearn import cross_validation,preprocessing from sklearn import neighbors import matplotlib.pyplot as plt import pandas as pd
Now,we import out dataset into the notebook. The dataset we are using is the Iris dataset which can is available at https://archive.ics.uci.edu/ml/datasets/Iris (just download the iris.data file from the Data Folder and rename it to iris.data.text).
NOTE-Open the iris.data.txt in a text editor and add the line “sepal_length,sepal_width,petal_length,petal_width,class” (without the quotes) at the beginning of the file. This will work as the headlines for the different columns!
To learn what this data set is about, I suggest you to go through this short article before proceeding.
df=pd.read_csv('/Users/mananmanwani/Python Projects/iris.data.txt') #Copy the path address on the iris.data.text file! df.head()
After importing,we separate out the features and the labels of our data.We store the features in a numpy array and name it X and the labels are stored in another numpy array called y.
We use the method of Cross Validation under the sklearn module to split our data into two parts i.e. the training part on which our model will be trained and the testing part where our model will be checked for accuracy.
We construct a classifer that uses KNN algorithm which is already provided to us in the sklearn module. The ‘fit’ instance trains the classifier with the training data and ‘score’ instance calcualtes the accuracy of our model.
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform')
Our model is constructed and now it’s time to use it to differentiaite between different flowers on the basis of some new raw data.
We construct a new example that contains data of a flower that is likely to be “Iris Virginica” and use the ‘predict’ instance to analyse and find the result.
The model provides us with the desired result and hence we were successfully able to predict (accuracy ~ 96%) the type of flower!
I hope that you have built an intuition about the working of this algorithm. If you have any doubts, feel free to leave a comment below.
More algorithms coming soon, stay tuned 😀