Exploratory Data Analysis (EDA)
Exploratory Data Analysis (EDA) is an approach to analyzing data (mostly graphical). It’s where the researcher takes a bird’s eye view of the data and tries to make some sense of it. It’s often the first step in data analysis, implemented before any formal statistics techniques or any Machine Learning model is applied.
This is the first part of the Exploratory Data Analysis Series.
In this section, we will be focusing on types of EDA, and understanding the data set.
The purpose of exploratory data analysis is to:
- Check for missing data and other mistakes.
- Gain maximum insight into the data set and its underlying structure.
- Number of response variables. (Quality in example below)
- Check assumptions associated with any model fitting or hypothesis test.
- Observe outliers or other anomalies.
- Most important is to Identify the most influential variables.
Analysis can be classified into:
Univariate data is used for the simplest form of analysis. It is the type of data in which analysis are made only based on one variable. For example, there are sixty students in class VII. If the variable marks obtained in math were the subject, then in that case analysis will be based on the number of subjects fall into defined categories of marks.
Bivariate data is used for little complex analysis than as compared with univariate data. Bivariate data is the data in which analysis are based on two variables per observation simultaneously.
Multivariate data is the data in which analysis are based on more than two variables per observation. Usually multivariate data is used for explanatory purposes.
The most popular and basic plots used in EDA are:
- scatter plots
- Heat Maps
- box plots
- Pie Chart
- Bar Chart
- Line Chart
Let’s learn by performing Exploratory Data Analysis (EDA) on a famous wine data set ! We will be using R in RStudio and White wine data set.
About the Data Set:
This dataset is related white variants of the Portuguese “Vinho Verde” wine. There are 11 attributes of White wine and an output attribute of Quality. There are 4898 observations with total of 12 variables.
Description of attributes::
1 – fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 – volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 – citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 – residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 – chlorides: the amount of salt in the wine
6 – free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 – total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 – density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 – pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 – sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 – alcohol: the percent alcohol content of the wine
12 – quality (score between 0 and 10)
White Wine Quality Exploration
library(ggplot2) library(dplyr) library(gridExtra) #install.packages("corrplot") library(corrplot) #install.packages('GGally') library(GGally)
About these libraries:
1)ggplot2 : ggplot2 is a plotting package for R, based on the grammar of graphics.
2)dplyr : This package will help in manipulation of our data frames
3)gridExtra : Provides a number of user-level functions to work with “grid” graphics, notably to arrange multiple grid-based plots on a page.
4)corrplot : Visualization of a Correlation Matrix.
5)GGally : Visualization of scatterplot matrix
# Load the Data df <- read.csv("wineQualityWhites.csv", row.names = 1) dfcopy<- df
Comment below for any doubt. Next up is Univariate Analysis!
A senior year engineering student on his learning adventure to become a Data Scientist.
Latest posts by Akshay Chaudhary (see all)
- Exploratory Data Analysis – White Wines Data – Part 4 – Multivariate Analysis - October 19, 2018
- Exploratory Data Analysis – White Wines Data – Part 3 – Bivariate Analysis - October 10, 2018
- Exploratory Data Analysis – White Wines Data – Part 1 – Introduction - August 30, 2018