5

ML Algos From Scratch – K-Nearest Neighbors

Share this article!

Hola Data Scientists! Hope you’re having a wonderful day.

Welcome to Part 1 of “ML Algos from Scratch” series. This series aims at building an intuition about machine learning algorithms, from how it works and what happens under the hood, to its implementation in Python.

In this part, we will learn about a very popular method of Supervised Learning called as K Nearest Neighbors (KNN). KNN is an algorithm mainly used for pattern recognition and as a classification technique. It is mostly used due to its low calculation time and easy to understand logic.
Let’s suppose we have two different groups of people named Red and Blue living on the
coordinate axis.

Now, a new person(let’s name him “Green”) joins the society and he has to get into a group which is in majority in his neighborhood.

To decide which group to go in, Green calculates the distance between his home(coordinates) and every other person. Now, he finds out the 3(‘k’) nearest neighbors and then finds the group in which the majority of the 3 are.

You can see that out of the 3 nearest neighbors, 2 are “red” and 1 is “blue”. So, “green” joins the “red” group. However, if we chose the value of k to be 5, then there would be 3 “blue” and 2 “red” neighbors, and “green” would have joined the “blue” group.

This is the main fundamental of KNN. Grouping a new data into some previously defined groups on the basis of its proximity to that group. Here k stands for the number of nearest neighbors that we checked for. It can be any number but we take ‘k’ as an odd number to remove the conflict in case two groups get the same number of nearest neighbors!

Now, let’s learn how to implement this in Python.

First we import all the neccesary modules in our iPython Notebook file!

In [56]:
import numpy as np
from sklearn import cross_validation,preprocessing
from sklearn import neighbors
import matplotlib.pyplot as plt
import pandas as pd

Now,we import out dataset into the notebook. The dataset we are using is the Iris dataset which can is available at https://archive.ics.uci.edu/ml/datasets/Iris (just download the iris.data file from the Data Folder and rename it to iris.data.text).

NOTE-Open the iris.data.txt in a text editor and add the line “sepal_length,sepal_width,petal_length,petal_width,class” (without the quotes) at the beginning of the file. This will work as the headlines for the different columns!

To learn what this data set is about, I suggest you to go through this short article before proceeding.

In [57]:
df=pd.read_csv('/Users/mananmanwani/Python Projects/iris.data.txt') #Copy the path address on the iris.data.text file!
df.head()
Out[57]:
sepal_length sepal_width petal_length petal_width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa

After importing,we separate out the features and the labels of our data.We store the features in a numpy array and name it X and the labels are stored in another numpy array called y.

In [58]:
X=np.array(df.drop(['class'],1))
y=np.array(df['class'])

We use the method of Cross Validation under the sklearn module to split our data into two parts i.e. the training part on which our model will be trained and the testing part where our model will be checked for accuracy.

In [59]:
X_train,X_test,y_train,y_test=cross_validation.train_test_split(X,y,test_size=0.2)

We construct a classifer that uses KNN algorithm which is already provided to us in the sklearn module. The ‘fit’ instance trains the classifier with the training data and ‘score’ instance calcualtes the accuracy of our model.

In [60]:
clf=neighbors.KNeighborsClassifier()
clf.fit(X_train,y_train)
Out[60]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')
In [61]:
accuracy=clf.score(X_test,y_test)
print(accuracy)
0.966666666667

Our model is constructed and now it’s time to use it to differentiaite between different flowers on the basis of some new raw data.

In [62]:
example=np.array([6.5,3.4,5.6,1.5])
example=example.reshape(1,-1)

We construct a new example that contains data of a flower that is likely to be “Iris Virginica” and use the ‘predict’ instance to analyse and find the result.

In [63]:
prediction=clf.predict(example)
print(prediction)
['Iris-virginica']

The model provides us with the desired result and hence we were successfully able to predict (accuracy ~ 96%) the type of flower!

I hope that you have built an intuition about the working of this algorithm. If you have any doubts, feel free to leave a comment below.

More algorithms coming soon, stay tuned 😀

Share this article!

Manan Manwani

Manan Manwani

2nd year BTech Information Technology student at Netaji Subhas Institute of Technology, Delhi. iOS app developer and currently studying Data Science and Machine Learning. Finance enthusiast!

Github- github.com/manan904
Linkedin- in.linkedin.com/in/manan-manwani-208664133
Quora- www.quora.com/profile/Manan-Manwani?share=49105fc4&srid=oaqt
Manan Manwani

Latest posts by Manan Manwani (see all)

Liked it? Take a second to support DataScribble on Patreon!

Manan Manwani

2nd year BTech Information Technology student at Netaji Subhas Institute of Technology, Delhi. iOS app developer and currently studying Data Science and Machine Learning. Finance enthusiast! Github- github.com/manan904 Linkedin- in.linkedin.com/in/manan-manwani-208664133 Quora- www.quora.com/profile/Manan-Manwani?share=49105fc4&srid=oaqt

5 Comments

  1. Really well written article. One thing it lacked according to me, was not mentioning how KNN finds the distance. The concept of Euclidean distance could have been added and explained with the help of code.

    • Hi Archit. Deciding whether a particular attribute is a label or a feature depends on the task which you need to carry out.
      Labels are the final answers that you want as output from your code. Here we need to find out the type of flower and hence the flower type column becomes the label.

      All other attributes that actually contribute to the final decision are known as features.
      If the final result is dependent on the attribute,then it is a feature.
      Eg Sepal wifth is a feature, S.No of the observation is not a feature.

Leave a Reply

Your email address will not be published. Required fields are marked *