In this tutorial, we will go ahead and make our predictive model for a student’s performance.
Let’s quickly go through what we have done till now in the previous sections.
We read the data set into a Pandas DataFrame, and then got rid of those instances for which the value of Grades was 0.
import pandas as pd
df = pd.read_csv("Student_math.csv", index_col=0)
df.drop(df[df.G3 == 0].index, inplace=True)
df.head()
Then we made a few effective visualizations using matplotlib and seaborn.
Now we are in a position to move ahead and build our model.
You can notice that most of our data is numerical, where we have used numbers to denote various categories like parents’ education level, students’ health satus, alcohol consumption amount, etc. However, we still have a few columns with string entries like school, sex, address, etc. It is required to convert these into numerical values before feeding them into the machine learning model. This can be easily done using the LabelEncoder function. Let’s see how it works. We will first import it into our notebook and make an instance of it.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
Now we will transform all those columns which have string data.
These columns have dtypes = object.
Hence, we can use this condition to identify them from a list of all the columns.
for i in df.columns.values: #iterate through the list of columns
if df[i].dtypes == object: #condition to identify desired columns
df[i] = le.fit_transform(df[i]) #transforming string values to numerical values
df.columns.values returns an array of all the columns.
df[i].dtypes == object returns true if the contents of that column are object type.
fit_transform method fits the data and transforms it into numerical values.
For example, in the column “School”, there are two possible values: GP and MS.
After running the above code, GP will be replaced with 0, and MS with 1. This will be done for every such column.
Let’s confirm this by printing the head of the DataFrame.
df.head()
So, there are only numerical values now.
Let’s move forward.
Now, we will divide our DataFrame into 2 componets: Features and Label.
Features are those columns which will help us in predicting the Target column, which is the Label.
X = df.drop("G3", axis=1)
y = df["G3"]
By convention, we use a capital X to denote features and a small y for label.
Now, we will again divide it into 2 parts: Training data and Testing data.
Training data will be used to train the model, or we can say that our model will learn from he training data. We will then test it out for accuracy on the testing data.
This spilliting can be done by using train_test_split function.
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
We have kept test_size = 0.3, which means that 30% of the data will be our test data, and 70% will be training data.
X_train and y_train contain those 70% of the features and corresponding labels.
X_test and y_test contain the 30% features and labels.
Now, let’s go ahead and make our model.
from sklearn.linear_model import LinearRegression
clf = LinearRegression()
We are using the Linear Regression method.
We have created an instance of it, clf.
Now we just need to fit the model with training data and use it on the test data to make predictions.
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
pred
We have trained our model using fit() function, and then made predictions from test features using predict() function.
The only thing left to be done is getting to know the accuracy, which can be done easily using the score() function.
acc = clf.score(X_test, y_test)
acc
That’s pretty good accuracy!
Notice that your result may differ somewhat beacause train_test_split() randomly chooses training and testing sets.
This brings us to the end of this series.
Congatulations! You are now ready to solve your own Regression Problem and take on the world!
As always, comment out your doubts and suggestions. Happy learning. 🙂
Pranav Gupta
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
LinkedIn: https://www.linkedin.com/in/prnvg/
Latest posts by Pranav Gupta (see all)
- Data Science… Where to START ?! - May 25, 2019
- The Subtle Differences among Data Science, Machine Learning, and Artificial Intelligence - May 4, 2019
- Introductory Guide to NumPy - July 25, 2018
really nice post.
in this tutorial, we use regression for predicting numerical variable.
it is not classification, but how can you get the accuracy?
usually , could we get MSE or RMSE?
I have been surfing online more than 3 hours today, yet I never found any interesting article like yours. It is pretty worth enough for me. In my view, if all web owners and bloggers made good content as you did, the net will be a lot more useful than ever before.
nice post.
Thanks ! Glad you liked it.
This 4 part series is pure gold for beginners. Clean code and easy explanations, what else can one ask for.
Thanks mate !
when you are diving the dataframe in two parts,can you explain how you are selecting the features and labels,what criteria you are using to select it in general.
The features represent the characteristics of the data which are used to predict a value, and label represents what we want to predict.
In this example, we want to predict the final grade of the student, i.e., G3. Hence, it is the Label. All other columns will be used to predict the value of G3. Hence, they are the Features. Hope I was able to clear it for you. 🙂
please drop me the data to train.bi@gmail.com