5

Understanding Math Behind KNN (with codes in Python)

Share this article!

Understanding the Math behind K-Nearest Neighbors Algorithm using Python

The K-Nearest Neighbor algorithm (KNN) is an elementary but important machine learning algorithm. KNN can be used for both classification and regression predictive problems. The reason for the popularity of KNN can be attributed to its easy interpretation and low calculation time.

Let us begin our understanding of this technique by diving deeper into the mathematics behind it.

Suppose we have a set of many data points blue and red and we want to classify the a new black colored point as either red or blue.

What the algorithm does, is find the 3 nearest neighbors(if K=3), and check the color of these three nearest neighbors. As majority of the three nearest neighbors are red the point is classified as red. The K parameter decides how many of the nearest neighbors have to be considered to determine the property of our unknown point.

You might be wondering how are these nearest neighbors found. This is where the mathematics comes in.

KNN uses a similarity metric to determine the nearest neighbors. This similarity metric is more often than not the Euclidean distance between our unknown point and the other points in the dataset. The general formula for Euclidean distance is:

where q1 to qn represent the attribute values for one observation and p1 to pn represent the attriubute values for the other observation.

Let us understand this better with an example.

The AirBnb dataset

AirBnB is a marketplace for short term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Over the years, Airbnb has grown to become a popular alternative to hotels.

One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. This is where the dataset on all other listings of a particular place helps.The dataset for
Washington D.C. can be downloaded from this link.
http://data.insideairbnb.com/united-states/dc/washington-dc/2015-10-03/data/listings.csv.gz

In this problem, we will assume that we have a house in Washington D.C. that we want to put up on Airbnb but we are facing a problem as to deciding its nightly rent. If we choose the rent as too high, then renters will find a cheaper alternative and if we rent it too low we might miss out on making a profit. K-Nearest Neighbors can help us here.

We can find a few listings that are similar to ours, average the listed price for the ones most similar to ours,
and set our listing price to this calculated average price.

Let’s begin.

Reading and cleaning the dataset

In [1]:
import numpy as np
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

Let’s read in the dataset into a pandas dataframe and display the columns

In [2]:
dc_listings = pd.read_csv('listings.csv.gz')
dc_listings.columns
Out[2]:
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

Looks like there are a lot of columns. Many of the columns contain urls of the various pictures on a host’s page which are of no use to us. Columns such as city,city_code,country,latitude, longitude etc. don’t serve any useful information to us if we want to predict prices.

For predicting the rent what are the most common factors that come to head.

  1. The no. of people the house can accommodate.
  2. The no. of bedrooms it has.
  3. The no. of bathrooms it has.
  4. The no. of reviews which shows the credibility of the host.
    These 4 columns along with the column ‘price’ which we need to predict are kept in the dataframe and the rest are ommited.
In [3]:
dc_listings = dc_listings[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','price']]

Let’s have a look at our dataset.

In [4]:
dc_listings.head()
Out[4]:
accommodates bedrooms bathrooms number_of_reviews price
0 4 1.0 1.0 0 $160.00
1 6 3.0 3.0 65 $350.00
2 1 1.0 2.0 1 $50.00
3 2 1.0 1.0 0 $95.00
4 4 1.0 1.0 0 $50.00

Let’s checkout the price column.

In [5]:
dc_listings['price'].head()
Out[5]:
0    $160.00
1    $350.00
2     $50.00
3     $95.00
4     $50.00
Name: price, dtype: object

It looks like the prices are in ‘string’ or ‘object’ format. To convert it into the numeric format, we strip the \$ sign from each price.
If we look further, we will find some prices are like $1,000. So we also remove the ‘,’ from all the prices before finally converting the all the values in the column to ‘float’ type.

In [6]:
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace('$',''))
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace(',',''))
dc_listings['price'] = dc_listings['price'].astype('float')

Once, the price column is fixed let us check if there are any missing values in the dataframe.

In [7]:
dc_listings.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3723 entries, 0 to 3722
Data columns (total 5 columns):
accommodates         3723 non-null int64
bedrooms             3702 non-null float64
bathrooms            3696 non-null float64
number_of_reviews    3723 non-null int64
price                3723 non-null float64
dtypes: float64(3), int64(2)
memory usage: 145.5 KB

There are indeed some missing values, in the ‘bedrooms’,’bathrooms’, ‘review_scores_rating’ and ‘number_of_reviews’ columns. We drop the rows in the dataframe with any missing values.

In [8]:
dc_listings = dc_listings.dropna(axis=0)
dc_listings.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3675 entries, 0 to 3722
Data columns (total 5 columns):
accommodates         3675 non-null int64
bedrooms             3675 non-null float64
bathrooms            3675 non-null float64
number_of_reviews    3675 non-null int64
price                3675 non-null float64
dtypes: float64(3), int64(2)
memory usage: 172.3 KB

Let us split the dataset into training and test sets. We apply the most common 70-30 rule taking 70 percent of the rows in the training set and the remaining in the test set.

In [9]:
train_rows  =  int(dc_listings.shape[0] * 0.7)
train = dc_listings.iloc[0:train_rows]
test = dc_listings.iloc[train_rows:]

Now let us remove the ‘price’ column from our list of columns and use the remaining columns for training.

In [10]:
features = list(train.columns)
features.remove('price')
features
Out[10]:
['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']

For each entry in our test set we will iterate over all the entries in the training set and calculate the euclidean distance. Thus, we are calculating the euclidean distances between each entry in our test set and all entries in our training set. After the euclidean distance has been calculated, we make a ‘distance’ column in our training set, shuffle the resulting set and sort them in ascending order of distance. Choose the first 3(if K=3) or first 5(if K=5) entries of the sorted dataset and find the mean of their ‘price’ column. So in simpler terms we are-

  1. Taking each entry in the test set.
  2. Finding its euclidean distance from each entry in the training set.
  3. Appending the calculated distance to a new column ‘distance’ in the training set.
  4. Randomnly shuffling the resulting set.
  5. Sorting the set in ascending order of distance.
  6. Choosing the first 5 entries(let K=5) i.e. the five nearest neighbors.
  7. Calculating the mean of their ‘price’ column which is the predicted price.
  8. Appending the resulting predicted price to a new column ‘predicted_price’ in the test set.

This is how the euclidean distance is calculated between two observations:

Let us write a function that does all the does the tasks from step 2 to step 7.
This function predict_price takes in a row of the test set and the dataframe which is our training set.
It subtracts the row from all the rows of training set according to the corresponding column. After the difference are obtained it squares the results, sums it and takes it square root. Thus, coplying with the formula of Euclidean distance. It adds the calculated distance to a new column, shuffles the dataset, sorts it and finds mean of the first 5 entries with the minimumn distance. It then returns the calculated minimum distance for the particular entry of the test set.

In [11]:
def predict_price(row,temp_df):
    distances = temp_df[features] - row
    squared = np.square(distances)
    dist = squared.sum(axis=1)
    temp_df['distance'] = dist**0.5
    new_df = temp_df.sort_values('distance')
    nearest_neighbor_prices = new_df['price'].iloc[0:5]
    predicted_price = nearest_neighbor_prices.mean()
    return predicted_price
   

Now, let us iterate over each row of test set and get the predicted_price for each entry. Then we append the predicted price to a column in the test set.

In [12]:
for i,row in test[features].iterrows():
    predicted_price = predict_price(row,train)
    test.loc[i,'predicted_price'] = predicted_price

Let us look at our test set now.

In [13]:
test.head(10)
Out[13]:
accommodates bedrooms bathrooms number_of_reviews price predicted_price
2615 2 0.0 1.0 0 500.0 133.4
2616 5 2.0 1.0 6 400.0 175.4
2617 8 4.0 1.5 12 285.0 455.0
2618 4 1.0 1.5 2 110.0 126.6
2619 4 2.0 1.0 23 99.0 199.6
2620 4 2.0 1.5 2 125.0 210.6
2621 2 1.0 1.0 3 55.0 94.8
2622 3 1.0 1.0 1 120.0 141.6
2623 2 1.0 1.0 1 130.0 108.8
2624 2 1.0 1.0 16 71.0 119.8

We have made the predictions on our test set. It is time to calculate the error now. The reason we check the Mean Absolute Error(MAE) instead of the root Mean Square Error(RMSE) is that in RMSE, even small error of \$10 is penalised to a larger extent. Thus MAE is a more true representation of our error

In [14]:
mae = np.absolute(test['price'] - test['predicted_price']).sum()/test.shape[0]
print(mae)
53.762828649138704

Making Predictions

Let us assume we have a flat which can accommodate 4 people, has 2 bedrooms, and 2 bathrooms. Also, since it is a new flat, it does not have any number_of_reviews.
Let us store all our data into a pandas series in the same corresponding order as our dataset.
Then we call the predicted function on it.

In [15]:
our_flat = pd.Series([4,2,2,0])
our_flat.index = pd.Index(features)
predicted_price = predict_price(our_flat,train)
print("We should list our flat at a nightly rate of $%d."%(predicted_price))
We should list our flat at a nightly rate of $234.

Thus we learned the inner workings of this algorithm using codes in python. If you have any doubt, feel free to comment it out below.

Happy Learning 🙂

Share this article!

Saksham Malhotra

Saksham Malhotra

After learning python 2 years ago and dabbling in web development, I encountered data science and felt 'Yes, this is what I want to do.' I strongly believe that the world can be changed through the power of data. Among my other interests, I love reading fiction and owning more books than I can possibly read at once.

Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b
Saksham Malhotra

Latest posts by Saksham Malhotra (see all)

Saksham Malhotra

After learning python 2 years ago and dabbling in web development, I encountered data science and felt 'Yes, this is what I want to do.' I strongly believe that the world can be changed through the power of data. Among my other interests, I love reading fiction and owning more books than I can possibly read at once. Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b

5 Comments

  1. Awesome tutorial.What if i want to predict something which is not numeric.Let’s say in this particular AirBnb dataset, i am someone who is looking for the best house in the place i want given how much i can afford, how much rooms i need, then how we will predict that particular house using KNN

    • It would work in the same way. We would just have to tweak the code a little bit.
      When you enter your requirements(including price), KNN will find the nearest neighbors to your attributes/requirements, and instead of performing any calculations on them, we can just display them. Those 5 will be the most suitable houses according to your requirement. I hope I answered your question.

  2. Hi, i am a complete newbie to coding, can you tell me the need to use warning package.

Leave a Reply

Your email address will not be published. Required fields are marked *