9

Comprehensive Regression Series – Predicting Student Performance – Part 4 – Making the Predictive Model

Share this article!

This is the fourth part of the Comprehensive Regression Series.
We strongly recommend you to go through the previous parts before starting with this one.
The series is as follows:
Part 4 – Making the Predictive Model (this article)

 

In this tutorial, we will go ahead and make our predictive model for a student’s performance.

Let’s quickly go through what we have done till now in the previous sections.
We read the data set into a Pandas DataFrame, and then got rid of those instances for which the value of Grades was 0.

In [1]:
import pandas as pd
df = pd.read_csv("Student_math.csv", index_col=0)
df.drop(df[df.G3 == 0].index, inplace=True)
df.head()
Out[1]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

Then we made a few effective visualizations using matplotlib and seaborn.

Now we are in a position to move ahead and build our model.

You can notice that most of our data is numerical, where we have used numbers to denote various categories like parents’ education level, students’ health satus, alcohol consumption amount, etc. However, we still have a few columns with string entries like school, sex, address, etc. It is required to convert these into numerical values before feeding them into the machine learning model. This can be easily done using the LabelEncoder function. Let’s see how it works. We will first import it into our notebook and make an instance of it.

In [2]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

Now we will transform all those columns which have string data.
These columns have dtypes = object.
Hence, we can use this condition to identify them from a list of all the columns.

In [3]:
for i in df.columns.values:               #iterate through the list of columns
    if df[i].dtypes == object:            #condition to identify desired columns
        df[i] = le.fit_transform(df[i])   #transforming string values to numerical values

df.columns.values returns an array of all the columns.

df[i].dtypes == object returns true if the contents of that column are object type.

fit_transform method fits the data and transforms it into numerical values.

For example, in the column “School”, there are two possible values: GP and MS.
After running the above code, GP will be replaced with 0, and MS with 1. This will be done for every such column.

Let’s confirm this by printing the head of the DataFrame.

In [4]:
df.head()
Out[4]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 0 0 18 1 0 0 4 4 0 4 4 3 4 1 1 3 6 5 6 6
1 0 0 17 1 0 1 1 1 0 2 5 3 3 1 1 3 4 5 5 6
2 0 0 15 1 1 1 1 1 0 2 4 3 2 2 3 3 10 7 8 10
3 0 0 15 1 0 1 4 2 1 3 3 2 2 1 1 5 2 15 14 15
4 0 0 16 1 0 1 3 3 2 2 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

So, there are only numerical values now.

Let’s move forward.

Now, we will divide our DataFrame into 2 componets: Features and Label.

Features are those columns which will help us in predicting the Target column, which is the Label.

In [5]:
X = df.drop("G3", axis=1)
y = df["G3"]

By convention, we use a capital X to denote features and a small y for label.

Now, we will again divide it into 2 parts: Training data and Testing data.
Training data will be used to train the model, or we can say that our model will learn from he training data. We will then test it out for accuracy on the testing data.

This spilliting can be done by using train_test_split function.

In [6]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
C:\Users\Pranav\Anaconda2\lib\site-packages\sklearn\cross_validation.py:44: DeprecationWarning: This module was deprecated in version 0.18 in favor of the model_selection module into which all the refactored classes and functions are moved. Also note that the interface of the new CV iterators are different from that of this module. This module will be removed in 0.20.
  "This module will be removed in 0.20.", DeprecationWarning)

We have kept test_size = 0.3, which means that 30% of the data will be our test data, and 70% will be training data.

X_train and y_train contain those 70% of the features and corresponding labels.

X_test and y_test contain the 30% features and labels.

Now, let’s go ahead and make our model.

In [7]:
from sklearn.linear_model import LinearRegression
clf = LinearRegression()

We are using the Linear Regression method.
We have created an instance of it, clf.

Now we just need to fit the model with training data and use it on the test data to make predictions.

In [8]:
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
pred
Out[8]:
array([ 12.51393401,  18.18987207,   4.42768241,   9.16059729,
        14.28567866,   7.58571906,   7.38527528,  14.07841796,
         8.33873292,  12.13458235,  10.08520185,  13.43587516,
         5.2193941 ,  18.2171288 ,  16.24361584,  17.37211706,
        10.39896643,  11.59616875,  13.00031178,  13.4581169 ,
         6.2929912 ,   9.33604262,  12.22283708,  12.14459509,
         5.75231993,  18.20057129,   8.76621951,   8.37920238,
        11.64465319,   4.86504009,  11.0255172 ,  13.35430825,
        11.28166246,  11.42017383,  16.60248803,  11.22994202,
         8.6686866 ,   9.74217097,   9.97131691,   9.11305365,
         5.71361428,  13.29647228,  10.42822387,  12.08377734,
        13.77801183,  11.73454716,  15.20652575,   7.72007646,
         9.77128068,   5.77796792,  15.16576655,  10.1267426 ,
        11.42509943,  14.35414888,  14.25009222,  13.11393108,
         8.56354847,  10.17406577,   9.74141498,  14.8384602 ,
        12.23533563,  14.8832286 ,   9.62973959,   9.6217026 ,
        15.28115066,  15.97316659,  12.5636143 ,   6.85127777,
        10.35713125,  14.67568653,   7.16041255,  13.54747098,
         5.67877571,  14.31102608,  13.32593698,  12.96826199,
        11.78757012,   9.24669248,  13.03635794,  15.23564163,
         8.80465866,  16.25420385,  16.44748341,  18.3018279 ,
        13.92067759,  14.29603112,  12.21256048,  10.4457754 ,
         6.24656748,   8.42218708,   4.9585227 ,  15.79362031,
        10.27867987,  10.36950758,  10.9766083 ,  12.13776921,
        11.01180076,   9.19893073,  15.1127327 ,  14.92453175,
        12.21173077,  14.28582502,   8.81775043,  13.96594846,
         8.46486789,   8.22979967,  13.64501045,  18.00085067])

We have trained our model using fit() function, and then made predictions from test features using predict() function.

The only thing left to be done is getting to know the accuracy, which can be done easily using the score() function.

In [9]:
acc = clf.score(X_test, y_test)
acc
Out[9]:
0.9251908725048571

That’s pretty good accuracy!
Notice that your result may differ somewhat beacause train_test_split() randomly chooses training and testing sets.

This brings us to the end of this series.

Congatulations! You are now ready to solve your own Regression Problem and take on the world!

As always, comment out your doubts and suggestions. Happy learning. 🙂

Share this article!

Pranav Gupta

Co-Founder at DataScribble
An always cheerful and optimistic guy, with a knack for achieving the set target at any cost.
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
LinkedIn: https://www.linkedin.com/in/pranav-gupta-284141126/
Liked it? Take a second to support DataScribble on Patreon!

Pranav Gupta

An always cheerful and optimistic guy, with a knack for achieving the set target at any cost. I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer. LinkedIn: https://www.linkedin.com/in/pranav-gupta-284141126/

9 Comments

  1. really nice post.
    in this tutorial, we use regression for predicting numerical variable.
    it is not classification, but how can you get the accuracy?
    usually , could we get MSE or RMSE?

  2. I have been surfing online more than 3 hours today, yet I never found any interesting article like yours. It is pretty worth enough for me. In my view, if all web owners and bloggers made good content as you did, the net will be a lot more useful than ever before.

  3. This 4 part series is pure gold for beginners. Clean code and easy explanations, what else can one ask for.

  4. when you are diving the dataframe in two parts,can you explain how you are selecting the features and labels,what criteria you are using to select it in general.

    • The features represent the characteristics of the data which are used to predict a value, and label represents what we want to predict.
      In this example, we want to predict the final grade of the student, i.e., G3. Hence, it is the Label. All other columns will be used to predict the value of G3. Hence, they are the Features. Hope I was able to clear it for you. 🙂

Leave a Reply

Your email address will not be published. Required fields are marked *