0

Data Exploration and Visualization

Share this article!

Data Exploration and Visualization with 911 Calls

 

For this project we will be analyzing some 911 call data from Kaggle , here we will be focusing on data exploration and  visualizations as promised.

Download the data set here

We will be using one of my favorite library Seaborn. The main aim of this project is to sample the data set , plot visualizations and draw insights from the data set. So first , let’s understand the data set, the data contains the following fields:

  • lat : String variable, Latitude
  • lng: String variable, Longitude
  • desc: String variable, Description of the Emergency Call
  • zip: String variable, Zipcode
  • title: String variable, Title
  • timeStamp: String variable, YYYY-MM-DD HH:MM:SS
  • twp: String variable, Township
  • addr: String variable, Address
  • e: String variable, Dummy variable (always 1)

Data and Setup


Here we will import the necessary libraries

Import numpy and pandas

In [1]:
import numpy as np 
import pandas as pd

Import visualization libraries and set %matplotlib inline.

In [2]:
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline 

Reading the .csv file as a data frame called df , NOTE: You may have to provide the full path of your csv file here, or save your notebook and the csv file in the same directory to use the commands mentioned below

In [3]:
df= pd.read_csv('911.csv')
df.head()# prints the first 5 rows of your DataFrame
Out[3]:
lat lng desc zip title timeStamp twp addr e
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station … 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP… 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St… 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;… 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S… NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1

In every Data science problem, it is the elementary need to thoroughly understand each column.

If you are still unsure about any of the column name and what it represents scroll  to the top of the page, where you will find necessary details to get you going

Check the info() of the df
This provides a concise summary of your data frame , read more here

In [4]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99492 entries, 0 to 99491
Data columns (total 9 columns):
lat          99492 non-null float64
lng          99492 non-null float64
desc         99492 non-null object
zip          86637 non-null float64
title        99492 non-null object
timeStamp    99492 non-null object
twp          99449 non-null object
addr         98973 non-null object
e            99492 non-null int64
dtypes: float64(3), int64(1), object(5)
memory usage: 6.8+ MB

Check the head of df

Before going a step forward , I would like you to go through the data set once more , what it is all about and get accustomed with the data set. Go through the columns or the data set description on kaggle where you will get a sheer understanding of what this data set is all about.

In [5]:
df.head()
Out[5]:
lat lng desc zip title timeStamp twp addr e
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station … 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP… 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St… 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;… 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S… NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1

I hope you all have a rough idea about the data set now , so let’s move forward

Before getting into this you should realize why this part of your data science expedition is important.There are no shortcuts for data exploration. If you are in a state of mind, that machine learning can sail you away from every data storm, trust me, it won’t. After some point of time, you’ll realize that you are struggling at improving model’s accuracy. In such situation, data exploration techniques will come to your rescue.

You should realize that the quality of your output is certainly dependent on the input you provide to your model

While exploring data you should have an eye of detective , a question and answer approach will certainly help. You should be acute enough to ask questions to your data adept enough to code them out

Here I will be asking some questions that entice me and will try my best to answer them

What are the top 5 zip-codes for 911 calls?

Extracting a column from the DataFrame  and applying a function and .head() will provide me only with the top 5 values . Easy enough !!

In [6]:
df['zip'].value_counts().head()
Out[6]:
19401.0    6979
19464.0    6643
19403.0    4854
19446.0    4748
19406.0    3174
Name: zip, dtype: int64

What are the top 5 townships (twp) for 911 calls?

In [7]:
df['twp'].value_counts().head()
Out[7]:
LOWER MERION    8443
ABINGTON        5977
NORRISTOWN      5890
UPPER MERION    5227
CHELTENHAM      4575
Name: twp, dtype: int64

Here , we’ll get the number of unique titles from the titles column present in the data frame

In [8]:
df['title'].nunique()
Out[8]:
110

Creating new features – The feature engineering

We just got 110 unique values in the titles column , i would like you to scroll through the titles column in the Data set once …

In the titles column there are “Reasons/Departments” specified before the title code. These are EMS, Fire, and Traffic. To make it more readable we will create a new column called “Reason” that contains this string value.

For example, if the title column value is EMS: BACK PAINS/INJURY , the Reason column value would be EMS.

In [9]:
df['Reason']=df['title'].apply(lambda x: x.split(':')[0])
df['Reason'].head()
Out[9]:
0     EMS
1     EMS
2    Fire
3     EMS
4     EMS
Name: Reason, dtype: object

Here I have used lambda function. Python supports the creation of anonymous functions (i.e. functions that are not bound to a name) at run-time, using a construct called “lambda”.

Here you can see the new column ‘Reason’ is added in the column

In [10]:
df.columns.values 
Out[10]:
array(['lat', 'lng', 'desc', 'zip', 'title', 'timeStamp', 'twp', 'addr',
       'e', 'Reason'], dtype=object)

What is the most common Reason for a 911 call based off of this new column?

In [11]:
df['Reason'].value_counts().head()
Out[11]:
EMS        48877
Traffic    35695
Fire       14920
Name: Reason, dtype: int64

Lets dive into visualizations and try and plot the same. We ll using  Seaborn to create a countplot of 911 calls by Reason and also check what Township has most calls and will plot top 10 of them

Read more on CountPlot

In [12]:
df.groupby('twp').count().sort_values('e', ascending=False)[0:10]
Out[12]:
lat lng desc zip title timeStamp addr e Reason
twp
LOWER MERION 8443 8443 8443 7202 8443 8443 8424 8443 8443
ABINGTON 5977 5977 5977 5675 5977 5977 5959 5977 5977
NORRISTOWN 5890 5890 5890 5610 5890 5890 5877 5890 5890
UPPER MERION 5227 5227 5227 3582 5227 5227 5203 5227 5227
CHELTENHAM 4575 4575 4575 3942 4575 4575 4549 4575 4575
POTTSTOWN 4146 4146 4146 4030 4146 4146 4123 4146 4146
UPPER MORELAND 3434 3434 3434 3123 3434 3434 3422 3434 3434
LOWER PROVIDENCE 3225 3225 3225 2970 3225 3225 3211 3225 3225
PLYMOUTH 3158 3158 3158 2578 3158 3158 3154 3158 3158
HORSHAM 3003 3003 3003 2764 3003 3003 2980 3003 3003
In [13]:
twp = df.groupby('twp').count().sort_values('e', ascending=False)[0:10]
plt.figure(figsize=(11,8))
sns.barplot(x='twp', y='e', data=twp.reset_index())
plt.xticks(rotation=45)
Out[13]:
(array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]), <a list of 10 Text xticklabel objects>)
In [14]:
sns.countplot('Reason' ,  data=df , palette='viridis' ,alpha=0.9)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999ade8470>

That is a cool visual clearly indicating that the EMS was high at numbers in 911 calls


Now let us begin to focus on time information. What is the data type of the objects in the timeStamp column?

In [15]:
type(df['timeStamp'].iloc[0]) #do read more on iloc and loc
Out[15]:
str

You should have seen that these timestamps are still strings. We want to convert this column to pandas date time objects using pd.to_datetime  .

In [16]:
df['timeStamp']=pd.to_datetime(df['timeStamp'])
type(df['timeStamp'].iloc[0])
Out[16]:
pandas._libs.tslib.Timestamp

You can now grab specific attributes from a Datetime object by calling them. For example:

time = df['timeStamp'].iloc[0]
time.hour

You can use Jupyter’s tab method to explore the various attributes you can call. Now that the timestamp column are actually DateTime objects, we’ll use .apply() to create 3 new columns called Hour, Month, and Day of Week. We will create these columns based off of the timeStamp column, that is more of an feature engineering approach.

In [17]:
df['Hour'] = df['timeStamp'].apply(lambda x: x.hour)
df['Month'] = df['timeStamp'].apply(lambda x: x.month)
df['Day of Week'] = df['timeStamp'].apply(lambda x: x.dayofweek)

You can now check the values of the added columns

Note: This method does not alter the ‘timeStamp’ column, rather it retrieves data from it and places it in the new columns.

In [18]:
df.columns.values
Out[18]:
array(['lat', 'lng', 'desc', 'zip', 'title', 'timeStamp', 'twp', 'addr',
       'e', 'Reason', 'Hour', 'Month', 'Day of Week'], dtype=object)

Notice how the Day of Week is an integer 0-6. We can use the .map() with this dictionary to map the actual string names to the day of the week:

dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
In [19]:
dmap = {0:'Mon',1:'Tue',2:'Wed',3:'Thu',4:'Fri',5:'Sat',6:'Sun'}
df['Day of Week']= df['Day of Week'].map(dmap)
df.head()
Out[19]:
lat lng desc zip title timeStamp twp addr e Reason Hour Month Day of Week
0 40.297876 -75.581294 REINDEER CT & DEAD END; NEW HANOVER; Station … 19525.0 EMS: BACK PAINS/INJURY 2015-12-10 17:40:00 NEW HANOVER REINDEER CT & DEAD END 1 EMS 17 12 Thu
1 40.258061 -75.264680 BRIAR PATH & WHITEMARSH LN; HATFIELD TOWNSHIP… 19446.0 EMS: DIABETIC EMERGENCY 2015-12-10 17:40:00 HATFIELD TOWNSHIP BRIAR PATH & WHITEMARSH LN 1 EMS 17 12 Thu
2 40.121182 -75.351975 HAWS AVE; NORRISTOWN; 2015-12-10 @ 14:39:21-St… 19401.0 Fire: GAS-ODOR/LEAK 2015-12-10 17:40:00 NORRISTOWN HAWS AVE 1 Fire 17 12 Thu
3 40.116153 -75.343513 AIRY ST & SWEDE ST; NORRISTOWN; Station 308A;… 19401.0 EMS: CARDIAC EMERGENCY 2015-12-10 17:40:01 NORRISTOWN AIRY ST & SWEDE ST 1 EMS 17 12 Thu
4 40.251492 -75.603350 CHERRYWOOD CT & DEAD END; LOWER POTTSGROVE; S… NaN EMS: DIZZINESS 2015-12-10 17:40:01 LOWER POTTSGROVE CHERRYWOOD CT & DEAD END 1 EMS 17 12 Thu

Now we will again use seaborn to create a countplot of the Hour, Day of Week , Month column with the hue based off of the Reason column.

Lets do this by hour first.

In [20]:
sns.countplot('Hour',data=df,hue = 'Reason' )
plt.legend(loc=(1.05,.8))
Out[20]:
<matplotlib.legend.Legend at 0x1999b476668>

We can easily conclude that the early hours in the day , 12 am – 6 am the calls our very few they tend to rise as the day passes and again diminishes at the end.
The EMS calls leads at every hour of the day , but traffic calls take over for a couple of hours in the evening, the fire calls has a max when the traffic is max is this a coincidence ? comment down below !!

Remove the hue here yourself and guess what functionality it provides

Day of the week

In [21]:
sns.countplot('Day of Week',data=df,hue = 'Reason' )
plt.legend(loc=(1.05,.8))
Out[21]:
<matplotlib.legend.Legend at 0x1999b476d68>

The fire calls stay almost the same , the EMS vary, but the most interesting thing is Traffic calls are quite low on weekends and that is what we expect.

Now doing the same for Month:

In [22]:
sns.countplot('Month',data=df,hue = 'Reason' ,palette='viridis' )
plt.legend(loc=(1.05,.8))
Out[22]:
<matplotlib.legend.Legend at 0x1999b272438>

Did you notice something strange about the Plot?


You should have noticed it was missing some Months, let’s see if we can maybe fill in this information by plotting the information in another way, possibly a simple line plot that fills in the missing months, in order to do this, we’ll need to do some work with pandas…

Now we will create a gropuby object called byMonth, where you group the DataFrame by the month column and use the count() method for aggregation. Use the head() method on this returned DataFrame.

In [23]:
byMonth = df.groupby('Month').count()
byMonth.head()
Out[23]:
lat lng desc zip title timeStamp twp addr e Reason Hour Day of Week
Month
1 13205 13205 13205 11527 13205 13205 13203 13096 13205 13205 13205 13205
2 11467 11467 11467 9930 11467 11467 11465 11396 11467 11467 11467 11467
3 11101 11101 11101 9755 11101 11101 11092 11059 11101 11101 11101 11101
4 11326 11326 11326 9895 11326 11326 11323 11283 11326 11326 11326 11326
5 11423 11423 11423 9946 11423 11423 11420 11378 11423 11423 11423 11423

Now we will create a simple plot off of the dataframe indicating the count of calls per month to compensate for the missing months data

In [24]:
byMonth['twp'].plot()
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999b37e1d0>

Now we will use Sbeaborn’s lmplot() to create a linear fit on the number of calls per month. Keep in mind we will need to reset the index to a column.

Read more here

In [25]:
sns.lmplot(x='Month',y='twp',data=byMonth.reset_index())
Out[25]:
<seaborn.axisgrid.FacetGrid at 0x1999b3e1ef0>

The linear line indicates that the values go down as the number of months increases , make sure you understand this plot , this provides you with an idea of the actual values and not the actual values itself. The scatter indicates the error in the shaded region.

Creating a new column called ‘Date’ that contains the date from the timeStamp column. We’ll use apply along with the .date() method.

In [26]:
df['Date']=df['timeStamp'].apply(lambda t:t.date())
df['Date'].head()
Out[26]:
0    2015-12-10
1    2015-12-10
2    2015-12-10
3    2015-12-10
4    2015-12-10
Name: Date, dtype: object

Now we can groupby this Date column with the count() aggregate and create a plot of counts of 911 calls.

In [27]:
df.groupby('Date').count()['twp'].plot()
plt.tight_layout()

Now we will recreate this plot but we will create 3 separate plots with each plot representing a Reason for the 911 call.
We will also plot the DistPlot for analyzing the distribution of a particular Reason


1.Traffic

In [28]:
bytraffic=df[df['Reason']=='Traffic'].groupby('Date').count()['twp']
bytraffic.head()
sns.distplot(bytraffic)
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999e2db400>
In [29]:
df[df['Reason']=='Traffic'].groupby('Date').count()['twp'].plot()
plt.title('Traffic')
plt.tight_layout()

The first plot that is the distplot- shows the the avg call by traffic falls between 100-200.

The second plot shows the line graph with month and calls – at the end of the second month there is a sharp increase in the number of calls , so why not just google this stuff up, you ll surely find the details of the storm that took place.


2.EMS

In [30]:
byEMS=df[df['Reason']== 'EMS'].groupby('Date').count()['twp']
sns.distplot(byEMS)
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999b3acd30>
In [31]:
df[df['Reason']== 'EMS'].groupby('Date').count()['twp'].plot()
plt.title('EMS')
plt.tight_layout()

The first plot that is the distplot- shows the the avg call by EMS which are about 200.

The second plot which is a line graph has a symmetric drop thrice during the certain time period


3.Fire

In [32]:
byfire=df[df['Reason']=='Fire'].groupby('Date').count()['twp']
sns.distplot(byfire)
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999ba2b4e0>
In [33]:
df[df['Reason']=='Fire'].groupby('Date').count()['twp'].plot()
plt.title('Fire')
plt.tight_layout()

The first plot that is the distplot- shows the the avg call by Fire are quite low when compared to the other two as we had seen beforr near to about 50.

The second plot shows the line graph with month and calls – This data as expected will have sharp increase and avg rate, that is, a large number of people would inform about the fires which had a affect on the mass, the sharp rise can be either a forest , factory , building or any other call informing about a mishap that may have affected a lot of people. Google the dates and comment the places down below.


Now let’s move on to creating heatmaps with seaborn and our data. We’ll first need to restructure the dataframe so that the columns become the Hours and the Index becomes the Day of the Week. There are lots of ways to do this, but I would recommend trying to combine groupby with an unstack method.

In [34]:
dayHour = df.groupby(by=['Day of Week','Hour']).count()['Reason'].unstack()
dayHour.head()
Out[34]:
Hour 0 1 2 3 4 5 6 7 8 9 14 15 16 17 18 19 20 21 22 23
Day of Week
Fri 275 235 191 175 201 194 372 598 742 752 932 980 1039 980 820 696 667 559 514 474
Mon 282 221 201 194 204 267 397 653 819 786 869 913 989 997 885 746 613 497 472 325
Sat 375 301 263 260 224 231 257 391 459 640 789 796 848 757 778 696 628 572 506 467
Sun 383 306 286 268 242 240 300 402 483 620 684 691 663 714 670 655 537 461 415 330
Thu 278 202 233 159 182 203 362 570 777 828 876 969 935 1013 810 698 617 553 424 354

5 rows × 24 columns

Now creating a HeatMap using this new DataFrame.

Heat-maps are too extensive and I would like you to refer to the documentation to get in depth knowledge.
HeatMaps

In [35]:
plt.figure(figsize=(12,6))
sns.heatmap(dayHour,cmap='magma' , lw =1)
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999bb7c9e8>

This again help us to analyze the number of calls with the hour but in a more functional manner by addingthe respected day to the hour

Sunday, over all seems to be the day with the least number of calls , followed by Saturday.
The weekdays has the most number of calls , but lets go a step ahead , though Sundays and Saturdays shows the least number of calls but both these days exceeds weekdays during the hours 12:00 am – 3 :00 am  – late night calls.
Irrespective of the day time 4:00 pm-6:00 pm shows an increase in number of calls and we account this for increase in traffic calls.


Now creating a clustermap using this DataFrame.

In [36]:
sns.clustermap(dayHour,cmap='viridis' ,lw=1)
C:\Users\Tanishk\Anaconda3\lib\site-packages\matplotlib\cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)
Out[36]:
<seaborn.matrix.ClusterGrid at 0x1999b3ec080>

Now repeating these same plots and operations, for a DataFrame that shows the Month as the column.

In [37]:
dayMonth = df.groupby(by=['Day of Week','Month']).count()['Reason'].unstack()
dayMonth.head()
Out[37]:
Month 1 2 3 4 5 6 7 8 12
Day of Week
Fri 1970 1581 1525 1958 1730 1649 2045 1310 1065
Mon 1727 1964 1535 1598 1779 1617 1692 1511 1257
Sat 2291 1441 1266 1734 1444 1388 1695 1099 978
Sun 1960 1229 1102 1488 1424 1333 1672 1021 907
Thu 1584 1596 1900 1601 1590 2065 1646 1230 1266
In [38]:
plt.figure(figsize=(12,6))
sns.heatmap(dayMonth,cmap='viridis',lw=1)
Out[38]:
<matplotlib.axes._subplots.AxesSubplot at 0x1999ba2b710>
In [39]:
sns.clustermap(dayMonth,cmap='viridis' ,lw=1)
C:\Users\Tanishk\Anaconda3\lib\site-packages\matplotlib\cbook.py:136: MatplotlibDeprecationWarning: The axisbg attribute was deprecated in version 2.0. Use facecolor instead.
  warnings.warn(message, mplDeprecation, stacklevel=1)
Out[39]:
<seaborn.matrix.ClusterGrid at 0x1999bb73128>

The months have an up and down journey but the calls decrease rapidly when we reach the year end that is from the 8th month to 12th month.


Great Job!

Congratulations on your first Data exploration and Visualization project

Your job doesn’t end here , continue exploring this data set for endless visualization by making the data set better and conclusive.

Feel free to comment out your doubts. Happy Learning 🙂

Share this article!

Tanishk Sachdeva

Leave a Reply

Your email address will not be published. Required fields are marked *