0

Comprehensive Regression Series – Predicting Student Performance – Part 3 – Visualizing the Data

Share this article!

This is the third part of the Comprehensive Regression Series.
We strongly recommend you to go through the previous parts before starting with this one.
The series is as follows:
Part 3 – Visualizing the Data (this article)

 

In this part, we will make beautiful visualizations from our data set. Visualizations help in analyzing the data set more easily.
Let’s start off from where we left before.
We will read the Data set into a Pandas DataFrame, and get rid of those entries for which “G3” = 0, indicating absence of student from the exam or missing data.

In [127]:
import pandas as pd
df = pd.read_csv("Student_math.csv", index_col=0)
df.drop(df[df.G3 == 0].index, inplace=True)
df.head()
Out[127]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

If any line of the above code is not clear, please refer to Part 1 and Part 2 of this series.

Now, we will start off by importing a couple of libraries for visualising the data.

In [128]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Matplotlib is the basic library used for visualizations in Python.
Seaborn is a more modern library, used to make attractive plot. It itself is built on the top of Matplolib only.
It is basic convention to import them as plt and sns respectively.

%matplotlib inline allows us to see the visuals inside of the Jupyter notebook, and not in a separate window.

There is a large number of plots you can make using these libraries. We encourage you to go through their documentations to see their capabilities.

Let’s start straight away by analysing our Target Attribute, i.e., “G3”.
We will quickly make a Histogram to check its distribution. It can be done by calling the distplot() method of seaborn library, and passing in the column and number of bins that we want.

In [129]:
sns.distplot(df["G3"], bins = 15)
Out[129]:
<matplotlib.axes._subplots.AxesSubplot at 0xd3239fd0>

We can see that most frequent grades lie around 10. Makes sense because most of the students perform just average.

The line shows that it is more or less a normal distribution. It is called the “kde line”. If you just want the histogram, just pass in an additional parameter kde=False.

The distplot allows us to analyse only one attribute at a time. To analyse two attribute together, we can use jointplot. with this, we can specify two attributes, one each for the two axes. It also shows a scatter plot between the two attributes.
Let us check it out by plotting the First Period Grade (“G1”) and Final Grade (“G3”)

In [130]:
sns.jointplot(x="G1", y="G3", data = df, kind = "scatter")
Out[130]:
<seaborn.axisgrid.JointGrid at 0xcfd125c0>

Here, we can see the correlation between the two grades. If a student scores good in first period exam, he performs well in the finals as well. Makes sense!

Let’s check the average final grade for Males and Females. This can be done by simply creating a barplot.

In [131]:
sns.barplot(x="sex", y="G3", data=df)
Out[131]:
<matplotlib.axes._subplots.AxesSubplot at 0xd5e096d8>

We see that on average, males perform slightly better than the females.

We can also do this by using built-in visualization tools of Pandas.
Let’s see how the grades are affected based on the area where the student lives.
The attribute “address” is “U” if the area is urban, and “R” if it is rural.

In [132]:
df.groupby("address")["G3"].mean().plot.bar()
Out[132]:
<matplotlib.axes._subplots.AxesSubplot at 0xd6213cf8>

So, students living in urban areas tend to perform better.

We can do the same for various attributes. Also, we can specify a different aggregate function like max or min, and the color of the bars as follows:

In [133]:
df.groupby("health")["G3"].max().plot.bar(color = "yellow")
Out[133]:
<matplotlib.axes._subplots.AxesSubplot at 0xd6368b38>

So, we see that somehow according to the data, unhealthy students perform somewhat better than healthier ones !
We can assume that this attribute might not be taken down correctly or some errors are present. Hence, we will not consider it in our list of features for predicting the grades.

Let’s see what effect does alcohol consumption make.

In [134]:
df.groupby("Walc")["G3"].min().plot.bar(color = "green")
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0xd6663ac8>

We can see that students consuming lesser alcohol tend to perform better. Makes sense!

Let’s see the performance of the different schools, based on sex.

In [135]:
df.groupby(["school","sex"])["G3"].mean().plot.bar(color="r")
Out[135]:
<matplotlib.axes._subplots.AxesSubplot at 0xd5cf4588>

We notice that in Gabriel Pereira school, Males perform better. However, in Mousinho da Silveira schhol, Females perform better. Interesting !

We can also make scatter plots for categorical attributes like “address” by using stripplot.

In [136]:
sns.stripplot(x="address", y="G3", data=df, hue="sex", jitter=True)
Out[136]:
<matplotlib.axes._subplots.AxesSubplot at 0xd69b4b38>

The attribute “hue” colors the scatter points based another categorical attribute (“sex” in this case).
jitter=True just prevents the scatter points from being completely stacked over each other, so that we can see them individually.
We can notice that in rural areas, girls tend to perform better, and the case is opposite for urban areas.

We can also set the color scheme using the attribute “palette”

In [142]:
sns.stripplot(x="address", y="G3", data=df, hue="sex", palette = "rainbow", jitter=True)
Out[142]:
<matplotlib.axes._subplots.AxesSubplot at 0xf13f3b70>

You can try passing various strings to this attribute.

One of the most popular plots is a pairplot.
It draws a scatter plot of every numerical attribute with every other numerical attribute.

In [143]:
sns.pairplot(df, hue="sex", palette = "Set1")
Out[143]:
<seaborn.axisgrid.PairGrid at 0xf6a36c18>

This does not give us much information for this data set. However, you can notice high correlation among “G1”, “G2”, and “G3” on the bottom right corner.

To see which columns are correlated with which columns, a better way is to use the corr() function.

In [144]:
df.corr()
Out[144]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
age 1.000000 -0.139999 -0.138532 0.106723 0.000447 0.271748 0.066234 0.002889 0.128041 0.142015 0.120844 -0.049694 0.215578 -0.030706 -0.158273 -0.140372
Medu -0.139999 1.000000 0.608327 -0.177805 0.055764 -0.214681 -0.006585 0.017753 0.078049 0.006122 -0.049314 -0.043790 0.075924 0.172444 0.203288 0.190308
Fedu -0.138532 0.608327 1.000000 -0.185481 -0.028631 -0.262197 -0.009537 -0.023222 0.042474 -0.018816 -0.018914 0.009127 0.008948 0.162752 0.178706 0.158811
traveltime 0.106723 -0.177805 -0.185481 1.000000 -0.095827 0.128950 -0.023566 -0.007936 0.037167 0.154209 0.139424 0.001316 0.004628 -0.086438 -0.109559 -0.099785
studytime 0.000447 0.055764 -0.028631 -0.095827 1.000000 -0.131072 0.052122 -0.152533 -0.047891 -0.199821 -0.247601 -0.072786 -0.074541 0.140638 0.119759 0.126728
failures 0.271748 -0.214681 -0.262197 0.128950 -0.131072 1.000000 -0.007802 0.103712 0.128388 0.167774 0.174172 0.046940 0.148261 -0.302071 -0.301316 -0.293831
famrel 0.066234 -0.006585 -0.009537 -0.023566 0.052122 -0.007802 1.000000 0.134631 0.030728 -0.079527 -0.126642 0.108042 -0.058076 0.010083 -0.005304 0.037711
freetime 0.002889 0.017753 -0.023222 -0.007936 -0.152533 0.103712 0.134631 1.000000 0.283519 0.209400 0.132759 0.086485 -0.070492 0.005429 -0.015486 -0.021589
goout 0.128041 0.078049 0.042474 0.037167 -0.047891 0.128388 0.030728 0.283519 1.000000 0.281761 0.444320 -0.009576 0.056590 -0.150527 -0.155100 -0.177383
Dalc 0.142015 0.006122 -0.018816 0.154209 -0.199821 0.167774 -0.079527 0.209400 0.281761 1.000000 0.644920 0.088875 0.104791 -0.128721 -0.127554 -0.140690
Walc 0.120844 -0.049314 -0.018914 0.139424 -0.247601 0.174172 -0.126642 0.132759 0.444320 0.644920 1.000000 0.111680 0.123100 -0.176541 -0.176656 -0.190054
health -0.049694 -0.043790 0.009127 0.001316 -0.072786 0.046940 0.108042 0.086485 -0.009576 0.088875 0.111680 1.000000 -0.029116 -0.072126 -0.059990 -0.081691
absences 0.215578 0.075924 0.008948 0.004628 -0.074541 0.148261 -0.058076 -0.070492 0.056590 0.104791 0.123100 -0.029116 1.000000 -0.120313 -0.199546 -0.213129
G1 -0.030706 0.172444 0.162752 -0.086438 0.140638 -0.302071 0.010083 0.005429 -0.150527 -0.128721 -0.176541 -0.072126 -0.120313 1.000000 0.901940 0.891805
G2 -0.158273 0.203288 0.178706 -0.109559 0.119759 -0.301316 -0.005304 -0.015486 -0.155100 -0.127554 -0.176656 -0.059990 -0.199546 0.901940 1.000000 0.965583
G3 -0.140372 0.190308 0.158811 -0.099785 0.126728 -0.293831 0.037711 -0.021589 -0.177383 -0.140690 -0.190054 -0.081691 -0.213129 0.891805 0.965583 1.000000

corr() makes a matrix of all the numerical columns and calculates the correlation.

You can notice that the diagonal elements are all 1’s. Makes sense, because each column is perfectly correlated with itself.

We can visualise it by making a heatmap.

In [145]:
sns.heatmap(df.corr())
Out[145]:
<matplotlib.axes._subplots.AxesSubplot at 0x1086e8be0>

Notice that darker colors correspond to more correlation.
To visualise it better, we can increase the figure size.

We can also display the correlation values by using annot=True.

In [146]:
fig, ax = plt.subplots(figsize=(20,20)) 
sns.heatmap(df.corr(), annot=True, ax=ax)
Out[146]:
<matplotlib.axes._subplots.AxesSubplot at 0x10f7546d8>
 

Notice how different attributes are correlated with other attributes. These can potentially help us in determining the features for our machine learning model, which we are going to do in the next part of the series.

You can play around with these plots passing different arguments and generating more insights.

Stay tuned for more. Happy learning 🙂

Share this article!

Pranav Gupta

Pranav Gupta

Co-Founder at DataScribble
An always cheerful and optimistic guy, with a knack for achieving the set target at any cost.
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
LinkedIn: https://www.linkedin.com/in/prnvg/
Pranav Gupta

Pranav Gupta

An always cheerful and optimistic guy, with a knack for achieving the set target at any cost. I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer. LinkedIn: https://www.linkedin.com/in/prnvg/

Leave a Reply

Your email address will not be published. Required fields are marked *