In this tutorial, we will go ahead and make our predictive model for a student’s performance.
Let’s quickly go through what we have done till now in the previous sections.
We read the data set into a Pandas DataFrame, and then got rid of those instances for which the value of Grades was 0.
import pandas as pd df = pd.read_csv("Student_math.csv", index_col=0) df.drop(df[df.G3 == 0].index, inplace=True) df.head()
5 rows × 33 columns
Then we made a few effective visualizations using matplotlib and seaborn.
Now we are in a position to move ahead and build our model.
You can notice that most of our data is numerical, where we have used numbers to denote various categories like parents’ education level, students’ health satus, alcohol consumption amount, etc. However, we still have a few columns with string entries like school, sex, address, etc. It is required to convert these into numerical values before feeding them into the machine learning model. This can be easily done using the LabelEncoder function. Let’s see how it works. We will first import it into our notebook and make an instance of it.
from sklearn.preprocessing import LabelEncoder le = LabelEncoder()
Now we will transform all those columns which have string data.
These columns have dtypes = object.
Hence, we can use this condition to identify them from a list of all the columns.
for i in df.columns.values: #iterate through the list of columns if df[i].dtypes == object: #condition to identify desired columns df[i] = le.fit_transform(df[i]) #transforming string values to numerical values
df.columns.values returns an array of all the columns.
df[i].dtypes == object returns true if the contents of that column are object type.
fit_transform method fits the data and transforms it into numerical values.
For example, in the column “School”, there are two possible values: GP and MS.
After running the above code, GP will be replaced with 0, and MS with 1. This will be done for every such column.
Let’s confirm this by printing the head of the DataFrame.
5 rows × 33 columns
So, there are only numerical values now.
Let’s move forward.
Now, we will divide our DataFrame into 2 componets: Features and Label.
Features are those columns which will help us in predicting the Target column, which is the Label.
X = df.drop("G3", axis=1) y = df["G3"]
By convention, we use a capital X to denote features and a small y for label.
Now, we will again divide it into 2 parts: Training data and Testing data.
Training data will be used to train the model, or we can say that our model will learn from he training data. We will then test it out for accuracy on the testing data.
This spilliting can be done by using train_test_split function.
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
C:\Users\Pranav\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20. "This module will be removed in 0.20.", DeprecationWarning)
We have kept test_size = 0.3, which means that 30% of the data will be our test data, and 70% will be training data.
X_train and y_train contain those 70% of the features and corresponding labels.
X_test and y_test contain the 30% features and labels.
Now, let’s go ahead and make our model.
from sklearn.linear_model import LinearRegression clf = LinearRegression()
We are using the Linear Regression method.
We have created an instance of it, clf.
Now we just need to fit the model with training data and use it on the test data to make predictions.
clf.fit(X_train, y_train) pred = clf.predict(X_test) pred
array([ 12.51393401, 18.18987207, 4.42768241, 9.16059729, 14.28567866, 7.58571906, 7.38527528, 14.07841796, 8.33873292, 12.13458235, 10.08520185, 13.43587516, 5.2193941 , 18.2171288 , 16.24361584, 17.37211706, 10.39896643, 11.59616875, 13.00031178, 13.4581169 , 6.2929912 , 9.33604262, 12.22283708, 12.14459509, 5.75231993, 18.20057129, 8.76621951, 8.37920238, 11.64465319, 4.86504009, 11.0255172 , 13.35430825, 11.28166246, 11.42017383, 16.60248803, 11.22994202, 8.6686866 , 9.74217097, 9.97131691, 9.11305365, 5.71361428, 13.29647228, 10.42822387, 12.08377734, 13.77801183, 11.73454716, 15.20652575, 7.72007646, 9.77128068, 5.77796792, 15.16576655, 10.1267426 , 11.42509943, 14.35414888, 14.25009222, 13.11393108, 8.56354847, 10.17406577, 9.74141498, 14.8384602 , 12.23533563, 14.8832286 , 9.62973959, 9.6217026 , 15.28115066, 15.97316659, 12.5636143 , 6.85127777, 10.35713125, 14.67568653, 7.16041255, 13.54747098, 5.67877571, 14.31102608, 13.32593698, 12.96826199, 11.78757012, 9.24669248, 13.03635794, 15.23564163, 8.80465866, 16.25420385, 16.44748341, 18.3018279 , 13.92067759, 14.29603112, 12.21256048, 10.4457754 , 6.24656748, 8.42218708, 4.9585227 , 15.79362031, 10.27867987, 10.36950758, 10.9766083 , 12.13776921, 11.01180076, 9.19893073, 15.1127327 , 14.92453175, 12.21173077, 14.28582502, 8.81775043, 13.96594846, 8.46486789, 8.22979967, 13.64501045, 18.00085067])
We have trained our model using fit() function, and then made predictions from test features using predict() function.
The only thing left to be done is getting to know the accuracy, which can be done easily using the score() function.
acc = clf.score(X_test, y_test) acc
That’s pretty good accuracy!
Notice that your result may differ somewhat beacause train_test_split() randomly chooses training and testing sets.
This brings us to the end of this series.
Congatulations! You are now ready to solve your own Regression Problem and take on the world!
As always, comment out your doubts and suggestions. Happy learning. 🙂
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
Latest posts by Pranav Gupta (see all)
- Data Science… Where to START ?! - May 25, 2019
- Introductory Guide to NumPy - July 25, 2018
- Comprehensive Regression Series – Predicting Student Performance – Part 2 – Exploratory Data Analysis - December 11, 2017