# Understanding The Data and Exploratory Data Analysis with Visualizations

This is the second part of the Comprehensive Classification Series.

We strongly recommend you to go through the previous parts before starting with this one.

The series is as follows:

Part 1 – Introduction to Kaggle

Part 2 – Understanding The Data and Exploratory Data Analysis(this article

The Titanic challenge on Kaggle is a competition in which the goal is to predict the survival or the death of a given passenger based on a set of variables describing him such as his age, his sex, or his passenger class on the boat.

Through this part of the tutorial our aim is to understand the data set and the problem.

I hope you are familiar with the term EDA – Exploratory Data Analysis you can find more details here in our another post EDA.

So lets begin and import the libraries we will be using.

A very easy way to install these packages is to download and install the Conda distribution that encapsulates them all. This distribution is available on all platforms (Windows, Linux and Mac OSX).

```
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib
matplotlib.style.use('ggplot')
%matplotlib inline
```

Two data sets are available: a training set and a test set. We’ll be using the training set to build our predictive model and the testing set to score it and generate an output file to submit on the Kaggle evaluation system.

We’ll see how this procedure is done at the end of this series.

Now let’s start by loading the training set.

So first lets load the data , pandas provides us many ways in which we can load our files and it extends to many file formats

```
data= pd.read_csv("train.csv")
```

Yes , a single line of code to read the data in. Now lets use the head method to peak into our data

```
data.head()
```

This helps to get into the top 5 rows of our data set, you can similarly use tail to check out the bottom five.

**Data Dictionary**

Variable Definition Key

survival Survival 0 = No, 1 = Yes

pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd

sex Sex

Age Age in years

sibsp # of siblings / spouses aboard the Titanic

parch # of parents / children aboard the Titanic

ticket Ticket number

fare Passenger fare

cabin Cabin number

embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

Variable Notes

pclass: A proxy for socio-economic status (SES)

1st = Upper

2nd = Middle

3rd = Lower

age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5

sibsp: The dataset defines family relations in this way…

Sibling = brother, sister, stepbrother, stepsister

Spouse = husband, wife (mistresses and fiancés were ignored)

parch: The data set defines family relations in this way…

Parent = mother, father

Child = daughter, son, stepdaughter, stepson

Some children traveled only with a nanny, therefore parch=0 for them.

Lets check the shape of the data, that is the number of rows and the number of coulmns, this gives us the rough idea about our data

```
data.shape
```

Pandas allows you to statistically describe numerical features using the describe method.

```
data.describe()
```

The count variable shows that 177 values are missing in the Age column.

One solution is to replace the null values with the median age which is more robust to outliers than the mean

```
data['Age'].fillna(data['Age'].median() , inplace= True )
```

Lets check that out again.

```
data.describe()
```

Perfect. Now we have a fair idea of what the data set looks like and what each column represents, lets dive in a bit with visualizations and lets analyze this data.

**Lets take up gender as our first criteria-**

```
sns.countplot(x='Survived',data=data , hue ='Sex' )
```

The Sex variable seems to be a decisive feature. Women are more likely to survive.

After this what comes to our mind is the age of the survives. So lets use age to correlate

```
fig =plt.figure(figsize=(12,9))
plt.hist( [data[data['Survived']==1]['Age'] , data[data['Survived'] ==0]['Age']] ,stacked=True , color = ['g','r'],
bins = 30,label = ['Survived','Dead'])
plt.xlabel('Age')
plt.ylabel('Number of passengers')
plt.legend()
```

If you go through this plot you will realize the survival rate is max in the age group 0-12 which clearly indicates children and female passengers ere the first preference.

Another feature that we might consider here is FARE, is the amount of money paid for the ticket correlated to the survival of passengers , Lets check this hypothesis with another intuitive graph.

```
figure = plt.figure(figsize=(15,8))
plt.hist([data[data['Survived']==1]['Fare'],data[data['Survived']==0]['Fare']], stacked=True, color = ['g','r'],
bins = 50,label = ['Survived','Dead'])
plt.xlabel('Fare')
plt.ylabel('Number of passengers')
plt.legend()
```

Passengers with cheaper ticket fares are more likely to die. Put differently, passengers with more expensive tickets, and therefore a more important social status, seem to be rescued first.

Let’s now combine the age, the fare and the survival on a single chart.

```
plt.figure(figsize = (15,8))
ax = plt.subplot()
ax.scatter(data[data['Survived']==1]['Age'] , data[data['Survived']==1]['Fare'] , c='green' , s=40)
ax.scatter(data[data['Survived']==0]['Age'] , data[data['Survived']==0]['Fare'] , c='red' , s=40)
ax.set_xlabel('Age')
ax.set_ylabel('Fare')
```

The red cluster obtained is of the people who died, you can spot the green cluster near the origin , low fare and children, the green spots are more vibrant as the fare increases. Thus fare is an interesting feature to deal with.

Moving on another column that is class is also seems interesting too .

```
ax = plt.subplot()
ax.set_ylabel('Average fare')
data.groupby('Pclass').mean()['Fare'].plot(kind='bar',figsize=(15,8), ax = ax)
```

Lets check that does the site of embarkation has any effect on survival, try and predict this before checking out the graph

```
sns.countplot('Survived' , data =data , hue ='Embarked')
```

There isn’t any specific correlation that we should look for.

To compare the numerical features of the data we use seaborn’s pair plot and see if we can find something interesting.

```
sns.pairplot(data)
```

Now your task is to get more visualization out of the data set and find out if another aspect is correlated too.

# Until Next Time

Our next task is to add features to our data to what we call as Feature Engineering. Then we will move on to build our model.

We hope you like this part of the series. Feel free to comment out your doubts.

Stay Tuned !!

### Tanishk Sachdeva

#### Latest posts by Tanishk Sachdeva (see all)

- Hypothesis Testing using Stroop Effect - August 3, 2019
- Customer Churn Prediction – Part 1 – Introduction - April 18, 2019
- Comprehensive Classification Series – Kaggle’s Titanic Problem Part 1: Introduction to Kaggle - December 20, 2017