Understanding the Math behind K-Nearest Neighbors Algorithm using Python
The K-Nearest Neighbor algorithm (KNN) is an elementary but important machine learning algorithm. KNN can be used for both classification and regression predictive problems. The reason for the popularity of KNN can be attributed to its easy interpretation and low calculation time.
Let us begin our understanding of this technique by diving deeper into the mathematics behind it.
What the algorithm does, is find the 3 nearest neighbors(if K=3), and check the color of these three nearest neighbors. As majority of the three nearest neighbors are red the point is classified as red. The K parameter decides how many of the nearest neighbors have to be considered to determine the property of our unknown point.
You might be wondering how are these nearest neighbors found. This is where the mathematics comes in.
KNN uses a similarity metric to determine the nearest neighbors. This similarity metric is more often than not the Euclidean distance between our unknown point and the other points in the dataset. The general formula for Euclidean distance is:
where q1 to qn represent the attribute values for one observation and p1 to pn represent the attriubute values for the other observation.
Let us understand this better with an example.
The AirBnb dataset
AirBnB is a marketplace for short term rentals that allows you to list part or all of your living space for others to rent. You can rent everything from a room in an apartment to your entire house on AirBnB. Over the years, Airbnb has grown to become a popular alternative to hotels.
One challenge that hosts looking to rent their living space face is determining the optimal nightly rent price. This is where the dataset on all other listings of a particular place helps.The dataset for
Washington D.C. can be downloaded from this link.
In this problem, we will assume that we have a house in Washington D.C. that we want to put up on Airbnb but we are facing a problem as to deciding its nightly rent. If we choose the rent as too high, then renters will find a cheaper alternative and if we rent it too low we might miss out on making a profit. K-Nearest Neighbors can help us here.
We can find a few listings that are similar to ours, average the listed price for the ones most similar to ours,
and set our listing price to this calculated average price.
Reading and cleaning the dataset
import numpy as np import pandas as pd import warnings warnings.filterwarnings('ignore')
Let’s read in the dataset into a pandas dataframe and display the columns
dc_listings = pd.read_csv('listings.csv.gz') dc_listings.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary', 'space', 'description', 'experiences_offered', 'neighborhood_overview', 'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url', 'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since', 'host_location', 'host_about', 'host_response_time', 'host_response_rate', 'host_acceptance_rate', 'host_is_superhost', 'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood', 'host_listings_count', 'host_total_listings_count', 'host_verifications', 'host_has_profile_pic', 'host_identity_verified', 'street', 'neighbourhood', 'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market', 'smart_location', 'country_code', 'country', 'latitude', 'longitude', 'is_location_exact', 'property_type', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet', 'price', 'weekly_price', 'monthly_price', 'security_deposit', 'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights', 'maximum_nights', 'calendar_updated', 'has_availability', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'calendar_last_scraped', 'number_of_reviews', 'first_review', 'last_review', 'review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'requires_license', 'license', 'jurisdiction_names', 'instant_bookable', 'cancellation_policy', 'require_guest_profile_picture', 'require_guest_phone_verification', 'calculated_host_listings_count', 'reviews_per_month'], dtype='object')
Looks like there are a lot of columns. Many of the columns contain urls of the various pictures on a host’s page which are of no use to us. Columns such as city,city_code,country,latitude, longitude etc. don’t serve any useful information to us if we want to predict prices.
For predicting the rent what are the most common factors that come to head.
- The no. of people the house can accommodate.
- The no. of bedrooms it has.
- The no. of bathrooms it has.
- The no. of reviews which shows the credibility of the host.
These 4 columns along with the column ‘price’ which we need to predict are kept in the dataframe and the rest are ommited.
dc_listings = dc_listings[['accommodates', 'bedrooms', 'bathrooms','number_of_reviews','price']]
Let’s have a look at our dataset.
Let’s checkout the price column.
0 $160.00 1 $350.00 2 $50.00 3 $95.00 4 $50.00 Name: price, dtype: object
It looks like the prices are in ‘string’ or ‘object’ format. To convert it into the numeric format, we strip the \$ sign from each price.
If we look further, we will find some prices are like $1,000. So we also remove the ‘,’ from all the prices before finally converting the all the values in the column to ‘float’ type.
dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace('$','')) dc_listings['price'] = dc_listings['price'].apply(lambda x:x.replace(',','')) dc_listings['price'] = dc_listings['price'].astype('float')
Once, the price column is fixed let us check if there are any missing values in the dataframe.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3723 entries, 0 to 3722 Data columns (total 5 columns): accommodates 3723 non-null int64 bedrooms 3702 non-null float64 bathrooms 3696 non-null float64 number_of_reviews 3723 non-null int64 price 3723 non-null float64 dtypes: float64(3), int64(2) memory usage: 145.5 KB
There are indeed some missing values, in the ‘bedrooms’,’bathrooms’, ‘review_scores_rating’ and ‘number_of_reviews’ columns. We drop the rows in the dataframe with any missing values.
dc_listings = dc_listings.dropna(axis=0) dc_listings.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 3675 entries, 0 to 3722 Data columns (total 5 columns): accommodates 3675 non-null int64 bedrooms 3675 non-null float64 bathrooms 3675 non-null float64 number_of_reviews 3675 non-null int64 price 3675 non-null float64 dtypes: float64(3), int64(2) memory usage: 172.3 KB
Let us split the dataset into training and test sets. We apply the most common 70-30 rule taking 70 percent of the rows in the training set and the remaining in the test set.
train_rows = int(dc_listings.shape * 0.7) train = dc_listings.iloc[0:train_rows] test = dc_listings.iloc[train_rows:]
Now let us remove the ‘price’ column from our list of columns and use the remaining columns for training.
features = list(train.columns) features.remove('price') features
['accommodates', 'bedrooms', 'bathrooms', 'number_of_reviews']
For each entry in our test set we will iterate over all the entries in the training set and calculate the euclidean distance. Thus, we are calculating the euclidean distances between each entry in our test set and all entries in our training set. After the euclidean distance has been calculated, we make a ‘distance’ column in our training set, shuffle the resulting set and sort them in ascending order of distance. Choose the first 3(if K=3) or first 5(if K=5) entries of the sorted dataset and find the mean of their ‘price’ column. So in simpler terms we are-
- Taking each entry in the test set.
- Finding its euclidean distance from each entry in the training set.
- Appending the calculated distance to a new column ‘distance’ in the training set.
- Randomnly shuffling the resulting set.
- Sorting the set in ascending order of distance.
- Choosing the first 5 entries(let K=5) i.e. the five nearest neighbors.
- Calculating the mean of their ‘price’ column which is the predicted price.
- Appending the resulting predicted price to a new column ‘predicted_price’ in the test set.
This is how the euclidean distance is calculated between two observations:
Let us write a function that does all the does the tasks from step 2 to step 7.
This function predict_price takes in a row of the test set and the dataframe which is our training set.
It subtracts the row from all the rows of training set according to the corresponding column. After the difference are obtained it squares the results, sums it and takes it square root. Thus, coplying with the formula of Euclidean distance. It adds the calculated distance to a new column, shuffles the dataset, sorts it and finds mean of the first 5 entries with the minimumn distance. It then returns the calculated minimum distance for the particular entry of the test set.
def predict_price(row,temp_df): distances = temp_df[features] - row squared = np.square(distances) dist = squared.sum(axis=1) temp_df['distance'] = dist**0.5 new_df = temp_df.sort_values('distance') nearest_neighbor_prices = new_df['price'].iloc[0:5] predicted_price = nearest_neighbor_prices.mean() return predicted_price
Now, let us iterate over each row of test set and get the predicted_price for each entry. Then we append the predicted price to a column in the test set.
for i,row in test[features].iterrows(): predicted_price = predict_price(row,train) test.loc[i,'predicted_price'] = predicted_price
Let us look at our test set now.
We have made the predictions on our test set. It is time to calculate the error now. The reason we check the Mean Absolute Error(MAE) instead of the root Mean Square Error(RMSE) is that in RMSE, even small error of \$10 is penalised to a larger extent. Thus MAE is a more true representation of our error
mae = np.absolute(test['price'] - test['predicted_price']).sum()/test.shape print(mae)
Let us assume we have a flat which can accommodate 4 people, has 2 bedrooms, and 2 bathrooms. Also, since it is a new flat, it does not have any number_of_reviews.
Let us store all our data into a pandas series in the same corresponding order as our dataset.
Then we call the predicted function on it.
our_flat = pd.Series([4,2,2,0]) our_flat.index = pd.Index(features) predicted_price = predict_price(our_flat,train) print("We should list our flat at a nightly rate of $%d."%(predicted_price))
We should list our flat at a nightly rate of $234.
Thus we learned the inner workings of this algorithm using codes in python. If you have any doubt, feel free to comment it out below.
Happy Learning 🙂
Linkedin - www.linkedin.com/in/saksham-malhotra-9bb69513b
Latest posts by Saksham Malhotra (see all)
- ML Algos from Scratch – KMeans Clustering - June 22, 2018
- Compressing Images using K-Means Clustering - April 20, 2018
- Introduction to SoftMax Regression (with codes in Python) - September 24, 2017