2

Comprehensive Regression Series – Predicting Student Performance – Part 2 – Exploratory Data Analysis

Share this article!

This is the second part of the Comprehensive Regression Series.
We strongly recommend you to go through the previous parts before starting with this one.
The series is as follows:
Part 2 – Exploratory Data Analysis (this article)

 

In this tutorial, we will use Pandas to explore our data set.
Pandas is one of the most famous libraries used for Data Science in Python.

Install it by simply running the following code on your command prompt in Windows:
pip install pandas

Alright then, let’s start our data explorartion.

It is common practice to import pandas as pd into your python file.

In [82]:
import pandas as pd

It has various functions to read various types of files. As our file is in .csv format, we will use read_csv() function.
Itconverts the file into a tabular representation much like MS Excel, which is called as a DataFrame.

In [83]:
df = pd.read_csv("Student_math.csv", index_col=0)

We can just pass in the name of the file or the complete path if your python file is in a different directory than your .csv file.
The second argument just tells pandas to make the 1st column the index of the DataFrame.
This is done because the first column is nothing but numbering from 0 onwards.

We have stored our DataFrame in a variable called df.
Let’s print it out.

In [84]:
df
Out[84]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10
5 GP M 16 U LE3 T 4 3 services other 5 4 2 1 2 5 10 15 15 15
6 GP M 16 U LE3 T 2 2 other other 4 4 4 1 1 3 0 12 12 11
7 GP F 17 U GT3 A 4 4 other teacher 4 1 4 1 1 1 6 6 5 6
8 GP M 15 U LE3 A 3 2 services other 4 2 2 1 1 1 0 16 18 19
9 GP M 15 U GT3 T 3 4 other other 5 5 1 1 1 5 0 14 15 15
10 GP F 15 U GT3 T 4 4 teacher health 3 3 3 1 2 2 0 10 8 9
11 GP F 15 U GT3 T 2 1 services other 5 2 2 1 1 4 4 10 12 12
12 GP M 15 U LE3 T 4 4 health services 4 3 3 1 3 5 2 14 14 14
13 GP M 15 U GT3 T 4 3 teacher other 5 4 3 1 2 3 2 10 10 11
14 GP M 15 U GT3 A 2 2 other other 4 5 2 1 1 3 0 14 16 16
15 GP F 16 U GT3 T 4 4 health other 4 4 4 1 2 2 4 14 14 14
16 GP F 16 U GT3 T 4 4 services services 3 2 3 1 2 2 6 13 14 14
17 GP F 16 U GT3 T 3 3 other other 5 3 2 1 1 4 4 8 10 10
18 GP M 17 U GT3 T 3 2 services services 5 5 5 2 4 5 16 6 5 5
19 GP M 16 U LE3 T 4 3 health other 3 1 3 1 3 5 4 8 10 10
20 GP M 15 U GT3 T 4 3 teacher other 4 4 1 1 1 1 0 13 14 15
21 GP M 15 U GT3 T 4 4 health health 5 4 2 1 1 5 0 12 15 15
22 GP M 16 U LE3 T 4 2 teacher other 4 5 1 1 3 5 2 15 15 16
23 GP M 16 U LE3 T 2 2 other other 5 4 4 2 4 5 0 13 13 12
24 GP F 15 R GT3 T 2 4 services health 4 3 2 1 1 5 2 10 9 8
25 GP F 16 U GT3 T 2 2 services services 1 2 2 1 3 5 14 6 9 8
26 GP M 15 U GT3 T 2 2 other other 4 2 2 1 2 5 2 12 12 11
27 GP M 15 U GT3 T 4 2 health services 2 2 4 2 4 1 4 15 16 15
28 GP M 16 U LE3 A 3 4 services other 5 3 3 1 1 5 4 11 11 11
29 GP M 16 U GT3 T 4 4 teacher teacher 4 4 5 5 5 5 16 10 12 11
365 MS M 18 R GT3 T 1 3 at_home other 3 3 4 2 4 3 4 10 10 10
366 MS M 18 U LE3 T 4 4 teacher services 4 2 2 2 2 5 0 13 13 13
367 MS F 17 R GT3 T 1 1 other services 5 2 1 1 2 1 0 7 6 0
368 MS F 18 U GT3 T 2 3 at_home services 5 2 3 1 2 4 0 11 10 10
369 MS F 18 R GT3 T 4 4 other teacher 3 2 2 4 2 5 10 14 12 11
370 MS F 19 U LE3 T 3 2 services services 3 2 2 1 1 3 4 7 7 9
371 MS M 18 R LE3 T 1 2 at_home services 4 3 3 2 3 3 3 14 12 12
372 MS F 17 U GT3 T 2 2 other at_home 3 4 3 1 1 3 8 13 11 11
373 MS F 17 R GT3 T 1 2 other other 3 5 5 1 3 1 14 6 5 5
374 MS F 18 R LE3 T 4 4 other other 5 4 4 1 1 1 0 19 18 19
375 MS F 18 R GT3 T 1 1 other other 4 3 2 1 2 4 2 8 8 10
376 MS F 20 U GT3 T 4 2 health other 5 4 3 1 1 3 4 15 14 15
377 MS F 18 R LE3 T 4 4 teacher services 5 4 3 3 4 2 4 8 9 10
378 MS F 18 U GT3 T 3 3 other other 4 1 3 1 2 1 0 15 15 15
379 MS F 17 R GT3 T 3 1 at_home other 4 5 4 2 3 1 17 10 10 10
380 MS M 18 U GT3 T 4 4 teacher teacher 3 2 4 1 4 2 4 15 14 14
381 MS M 18 R GT3 T 2 1 other other 4 4 3 1 3 5 5 7 6 7
382 MS M 17 U GT3 T 2 3 other services 4 4 3 1 1 3 2 11 11 10
383 MS M 19 R GT3 T 1 1 other services 4 3 2 1 3 5 0 6 5 0
384 MS M 18 R GT3 T 4 2 other other 5 4 3 4 3 3 14 6 5 5
385 MS F 18 R GT3 T 2 2 at_home other 5 3 3 1 3 4 2 10 9 10
386 MS F 18 R GT3 T 4 4 teacher at_home 4 4 3 2 2 5 7 6 5 6
387 MS F 19 R GT3 T 2 3 services other 5 4 2 1 2 5 0 7 5 0
388 MS F 18 U LE3 T 3 1 teacher services 4 3 4 1 1 1 0 7 9 8
389 MS F 18 U GT3 T 1 1 other other 1 1 1 1 1 5 0 6 5 0
390 MS M 20 U LE3 A 2 2 services services 5 5 4 4 5 4 11 9 9 9
391 MS M 17 U LE3 T 3 1 services services 2 4 5 3 4 2 3 14 16 16
392 MS M 21 R GT3 T 1 1 other other 5 5 3 3 3 3 3 10 8 7
393 MS M 18 R LE3 T 3 2 services other 4 4 1 3 4 5 0 11 12 10
394 MS M 19 U LE3 T 1 1 other at_home 3 2 3 3 3 5 5 8 9 9

395 rows × 33 columns

It displays the whole DataFrame, which can get annoying if the data is huge.
Instead, we can just print the top few rows of the data by using df.head()

In [85]:
df.head()
Out[85]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10

5 rows × 33 columns

By default, it displays the first “5” rows, but we can change it to any integer.

In [86]:
df.head(10)
Out[86]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
0 GP F 18 U GT3 A 4 4 at_home teacher 4 3 4 1 1 3 6 5 6 6
1 GP F 17 U GT3 T 1 1 at_home other 5 3 3 1 1 3 4 5 5 6
2 GP F 15 U LE3 T 1 1 at_home other 4 3 2 2 3 3 10 7 8 10
3 GP F 15 U GT3 T 4 2 health services 3 2 2 1 1 5 2 15 14 15
4 GP F 16 U GT3 T 3 3 other other 4 3 2 1 2 5 4 6 10 10
5 GP M 16 U LE3 T 4 3 services other 5 4 2 1 2 5 10 15 15 15
6 GP M 16 U LE3 T 2 2 other other 4 4 4 1 1 3 0 12 12 11
7 GP F 17 U GT3 A 4 4 other teacher 4 1 4 1 1 1 6 6 5 6
8 GP M 15 U LE3 A 3 2 services other 4 2 2 1 1 1 0 16 18 19
9 GP M 15 U GT3 T 3 4 other other 5 5 1 1 1 5 0 14 15 15

10 rows × 33 columns

Similarly, we can check the last rows using the function tail()

In [87]:
df.tail()
Out[87]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
390 MS M 20 U LE3 A 2 2 services services 5 5 4 4 5 4 11 9 9 9
391 MS M 17 U LE3 T 3 1 services services 2 4 5 3 4 2 3 14 16 16
392 MS M 21 R GT3 T 1 1 other other 5 5 3 3 3 3 3 10 8 7
393 MS M 18 R LE3 T 3 2 services other 4 4 1 3 4 5 0 11 12 10
394 MS M 19 U LE3 T 1 1 other at_home 3 2 3 3 3 5 5 8 9 9

5 rows × 33 columns

To check some rows in the middle or anywhere, we can use index slicing as well.
We tell the starting index (inclusive) and ending index (exclusive) of the required rows, separated by a colon.

In [88]:
df[10:15]
Out[88]:
school sex age address famsize Pstatus Medu Fedu Mjob Fjob famrel freetime goout Dalc Walc health absences G1 G2 G3
10 GP F 15 U GT3 T 4 4 teacher health 3 3 3 1 2 2 0 10 8 9
11 GP F 15 U GT3 T 2 1 services other 5 2 2 1 1 4 4 10 12 12
12 GP M 15 U LE3 T 4 4 health services 4 3 3 1 3 5 2 14 14 14
13 GP M 15 U GT3 T 4 3 teacher other 5 4 3 1 2 3 2 10 10 11
14 GP M 15 U GT3 A 2 2 other other 4 5 2 1 1 3 0 14 16 16

5 rows × 33 columns

You can skip the starting index to begin from the first row, and the ending index to go all the way till the last row.

However, we can see that as the number of columns is greater than what we can display, there are “…” in the middle, representing the presence of more columns there.
To get the list of all the columns, we can simply do the following:

In [89]:
df.columns
Out[89]:
Index([u'school', u'sex', u'age', u'address', u'famsize', u'Pstatus', u'Medu',
       u'Fedu', u'Mjob', u'Fjob', u'reason', u'guardian', u'traveltime',
       u'studytime', u'failures', u'schoolsup', u'famsup', u'paid',
       u'activities', u'nursery', u'higher', u'internet', u'romantic',
       u'famrel', u'freetime', u'goout', u'Dalc', u'Walc', u'health',
       u'absences', u'G1', u'G2', u'G3'],
      dtype='object')

The “u” just represents that it has been converted from UTF to ASCII encoding, so don’t worry much about it.
To get a specific column, we can simply write df.column_name or df[“column_name”].
Let’s check the last column, for example.

In [91]:
df["G3"]
Out[91]:
0       6
1       6
2      10
3      15
4      10
5      15
6      11
7       6
8      19
9      15
10      9
11     12
12     14
13     11
14     16
15     14
16     14
17     10
18      5
19     10
20     15
21     15
22     16
23     12
24      8
25      8
26     11
27     15
28     11
29     11
       ..
365    10
366    13
367     0
368    10
369    11
370     9
371    12
372    11
373     5
374    19
375    10
376    15
377    10
378    15
379    10
380    14
381     7
382    10
383     0
384     5
385    10
386     6
387     0
388     8
389     0
390     9
391    16
392     7
393    10
394     9
Name: G3, dtype: int64

Our data can sometimes have missing values or NaN values. To get the number of values in each column, we can call the count() function.

In [92]:
df.count()
Out[92]:
school        395
sex           395
age           395
address       395
famsize       395
Pstatus       395
Medu          395
Fedu          395
Mjob          395
Fjob          395
reason        395
guardian      395
traveltime    395
studytime     395
failures      395
schoolsup     395
famsup        395
paid          395
activities    395
nursery       395
higher        395
internet      395
romantic      395
famrel        395
freetime      395
goout         395
Dalc          395
Walc          395
health        395
absences      395
G1            395
G2            395
G3            395
dtype: int64

We se that each column has 395 values, which means that there are no missing values.
We can get a little more information about each column by using info() function. It just tells us the type of values in the column in addition to the count

In [93]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 395 entries, 0 to 394
Data columns (total 33 columns):
school        395 non-null object
sex           395 non-null object
age           395 non-null int64
address       395 non-null object
famsize       395 non-null object
Pstatus       395 non-null object
Medu          395 non-null int64
Fedu          395 non-null int64
Mjob          395 non-null object
Fjob          395 non-null object
reason        395 non-null object
guardian      395 non-null object
traveltime    395 non-null int64
studytime     395 non-null int64
failures      395 non-null int64
schoolsup     395 non-null object
famsup        395 non-null object
paid          395 non-null object
activities    395 non-null object
nursery       395 non-null object
higher        395 non-null object
internet      395 non-null object
romantic      395 non-null object
famrel        395 non-null int64
freetime      395 non-null int64
goout         395 non-null int64
Dalc          395 non-null int64
Walc          395 non-null int64
health        395 non-null int64
absences      395 non-null int64
G1            395 non-null int64
G2            395 non-null int64
G3            395 non-null int64
dtypes: int64(16), object(17)
memory usage: 104.9+ KB

Pandas provides a method to describe the numerial columns of the DataFrame, including the min, max, mean, standard deviation, etc for each numerical column.

In [94]:
df.describe()
Out[94]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000 395.000000
mean 16.696203 2.749367 2.521519 1.448101 2.035443 0.334177 3.944304 3.235443 3.108861 1.481013 2.291139 3.554430 5.708861 10.908861 10.713924 10.415190
std 1.276043 1.094735 1.088201 0.697505 0.839240 0.743651 0.896659 0.998862 1.113278 0.890741 1.287897 1.390303 8.003096 3.319195 3.761505 4.581443
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 3.000000 0.000000 0.000000
25% 16.000000 2.000000 2.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 3.000000 0.000000 8.000000 9.000000 8.000000
50% 17.000000 3.000000 2.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 4.000000 11.000000 11.000000 11.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 8.000000 13.000000 13.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 75.000000 19.000000 19.000000 20.000000

This can give you very comprehensive information about the numerical data in your Data Set. You can notice that the minimum value for G2 and G3 is 0, which should mean that the student did not appear fpr the exam or that the data is missing.
Hence, it will be good to get rid of those rows.
We can use the drop() function for that.

In [95]:
df.drop(df[df.G3 == 0].index, inplace=True)

drop() function deletes some rows based on the condition which we give. Here we want those rows to be deleted which have G3 = 0.
inpalce=True just tells python to make these changes into the original DataFrame. If we did not use inplace argument, it would have just returned a new DataFrame, and no changes would have been made to the original one.

Let’s check the description again.

In [96]:
df.describe()
Out[96]:
age Medu Fedu traveltime studytime failures famrel freetime goout Dalc Walc health absences G1 G2 G3
count 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000 357.000000
mean 16.655462 2.795518 2.546218 1.431373 2.042017 0.271709 3.955182 3.246499 3.098039 1.495798 2.330532 3.549020 6.316527 11.268908 11.358543 11.523810
std 1.268262 1.093999 1.084217 0.686075 0.831895 0.671750 0.885721 1.011601 1.090779 0.919886 1.294974 1.402638 8.187623 3.240450 3.147188 3.227797
min 15.000000 0.000000 0.000000 1.000000 1.000000 0.000000 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 0.000000 3.000000 5.000000 4.000000
25% 16.000000 2.000000 2.000000 1.000000 1.000000 0.000000 4.000000 3.000000 2.000000 1.000000 1.000000 3.000000 2.000000 9.000000 9.000000 9.000000
50% 17.000000 3.000000 3.000000 1.000000 2.000000 0.000000 4.000000 3.000000 3.000000 1.000000 2.000000 4.000000 4.000000 11.000000 11.000000 11.000000
75% 18.000000 4.000000 3.000000 2.000000 2.000000 0.000000 5.000000 4.000000 4.000000 2.000000 3.000000 5.000000 8.000000 14.000000 14.000000 14.000000
max 22.000000 4.000000 4.000000 4.000000 4.000000 3.000000 5.000000 5.000000 5.000000 5.000000 5.000000 5.000000 75.000000 19.000000 19.000000 20.000000

Notice that the minimum value of G3 as well as G2 is not 0 now. So we don’t have to perform the previous step again for G2.

Now, let’s check how many rows we have for each “sex”. This can be done with value_counts() function.

In [97]:
df["sex"].value_counts()
Out[97]:
F    185
M    172
Name: sex, dtype: int64

Similarly, we can check the number of instances for any other attribute, like “age”

In [98]:
df["age"].value_counts()
Out[98]:
16    97
17    90
15    76
18    70
19    19
20     3
22     1
21     1
Name: age, dtype: int64

What if we want to see which sex performs better than the other?
This can be done by groupby() function, and then getting the mean value of the Final grade (G3)

In [99]:
df.groupby("sex")["G3"].mean()
Out[99]:
sex
F    11.205405
M    11.866279
Name: G3, dtype: float64

So, we can conclude that there is not much difference, but males perform slightly better than females.
We can do this with any attribute like age.

So now, go ahead and play with these functions to explore the data in more detail, and generate insights from it.

Next, we will be visualizing the data to make it easier to understand.
Stay tuned !

Share this article!

Pranav Gupta

Pranav Gupta

Co-Founder at DataScribble
An always cheerful and optimistic guy, with a knack for achieving the set target at any cost.
I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer.
LinkedIn: https://www.linkedin.com/in/prnvg/
Pranav Gupta

Pranav Gupta

An always cheerful and optimistic guy, with a knack for achieving the set target at any cost. I am an avid learner and never shy off from working hard or working till late. I am also a passionate reader, and love to read thriller novels, Jeffrey Archer being the favorite writer. LinkedIn: https://www.linkedin.com/in/prnvg/

2 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *