0

Exploratory Data Analysis – White Wines Data – Part 4 – Multivariate Analysis

Share this article!

Multivariate Analysis

Multivariate Data Analysis refers to any statistical technique used to analyze data that arises from more than one variable.

This is the fourth part of the Exploratory Data Analysis Series.

The series is as follows:
Part 1 – Introduction to EDA
Part 2 – Univariate Analysis
Part 3 – Bivariate Analysis
Part 4 – Multivariate Analysis (this article)

Go through the previous parts to understand the data set (highly recommended).
In this section, we will be performing multivariate analysis on the white wines data set.
We will be using color and shapes to represent multivariate analysis.
ggplot(aes(x = alcohol, y = density), data = df)+
  ylim(c(quantile(df$density,0.05),quantile(df$density,0.95)))+
  geom_point(aes(color = quality.bucket))

 

It’s evident that majority of Good white wine is present at the lower right end of the plot i.e high alcohol and low density.

ggplot(aes(x=density_mean, y = alcohol_mean), data = df_by_quality)+
  geom_line(color = "Orange")+
  geom_point(aes(size = quality, color= quality))

 

It’s clear from the above two multivariate plots of Density and Alcohol that Good wines are majorly of high alcohol and low density.

df$alcohol_floor <- floor(df$alcohol)
df$alcohol.bucket <- cut(df$alcohol_floor, breaks = c(7,10,12,15), labels = c('High','Average','Low'))

ggplot(aes(x = residual.sugar, y = density), data = df)+
  ylim(c(quantile(df$density,0.05),quantile(df$density,0.95)))+
  xlim(c(quantile(df$residual.sugar,0.05),quantile(df$residual.sugar,0.95)))+
  geom_point(aes(size = alcohol.bucket, color = quality.bucket))

 

By plotting density against residual sugar and taking shape of a circle as alcohol.bucket and color as quality.bucket, we can observe the lower left end of the plot has both large and blue circles. That is, high alcohol and high quality with low density and low residual sugar.

p_d_ts1 <- ggplot(aes(x = density , y = total.sulfur.dioxide), data = df)+
  geom_point(aes(size = alcohol.bucket, color = alcohol.bucket))+
  xlim(c(quantile(df$density,0.05),quantile(df$density,0.95)))
  

p_d_ts2 <- ggplot(aes(x = density , y = total.sulfur.dioxide), 
       data = subset(df,df$alcohol.bucket == 'High'| df$alcohol.bucket=='Low'))+
  geom_point(aes(size = alcohol.bucket, color = alcohol.bucket))+
  xlim(c(quantile(df$density,0.05),quantile(df$density,0.95)))

grid.arrange(p_d_ts1, p_d_ts2, ncol = 2)

ggplot(aes(x = bound.sulfur.dioxide, y = alcohol ), data = df)+
  geom_line(aes(color = quality.bucket), stat = 'summary', fun.y = mean)

 

In this mean alcohol and bound.SO2 multivariate plot, we can see that at low SO2 levels the alcohol and quality are in good amount but as soon as SO2 increases beyond a particular amount, quality falls.

This can be because of increased density or the evident smell and taste of the wine.

ggplot(aes(x = citric.acid, y = pH), data = df)+
  geom_point(aes(color = quality.bucket))

ggplot(aes(x = citric.acid, y = pH), data = subset(df, (df$quality.bucket=='Good') | (df$quality.bucket=='Poor')))+
  geom_point(aes(color = quality.bucket))

 

As it was mentioned, citric acid adds freshness to the wine. So, I thought of exploring this variable in case it has some effect on quality. The relationship was not clear as there is abundant of Average wine in the plot.

So in next plot, average quality wines are removed and we can see that Good quality wines are majorly clustered in the middle which indicates that Good wines should have a mediocre amount of citric acid, not too less and not too much.

ggplot(aes(x = alcohol, y = chlorides), data = df)+
  ylim(c(quantile(df$chlorides,0.05),quantile(df$chlorides,0.95)))+
  geom_point(aes(color=quality.bucket),alpha = 1/2)+
  geom_smooth(stat = "summary", fun.y = mean, linetype=2, color = 'red')

 

Chlorides vs Alcohol relationship was somewhat clear to us. As salt in the wine increase, alcohol in wine tends to decrease. It’s now evident from the plot that Good Quality wines have lower chloride content and high alcoholic value as the majority of Good wines lie at the lower right end of the above plot.

The red line depicts the mean chloride value which is decreasing as alcohol amount in wine is increasing.

Let’s now summarise our above analysis by asking ourselves some important questions!

Let’s Talk about some of the relationships we observed in this part of the investigation. Were there features that strengthened each other in terms of looking at our feature(s) of interest?

In this analysis, we tried to form and represent relationships of quality with variables other than direct relationship variables like alcohol, density, and pH.

The relationship between residual.sugar and quality along with density and alcohol was observed. Most of the Good quality wines have low sugar, high alcohol, and less density.

Citric acid was never explored before and was explored in this section.

Other relationship observed were chlorides – quality and bound.SO2 – quality

Were there any interesting or surprising interactions between features?

Citric acid came as a surprise and was observed that Good quality wines have a mediocre amount of citric acid. Too less or too much of it can degrade its quality.


Final Plots and Summary

Plot One

p1 <- ggplot(aes(x = quality.bucket, y= alcohol), data = df) +
  geom_jitter( alpha = 1/4)  +
  geom_boxplot( alpha = .5,color = '#3F9B5D')+
  stat_summary(fun.y = "mean",geom = "point",color = "red", shape = 8,size = 4)+
  ggtitle("Alcohol in White Wines")+
  labs(x = "Quality", y = "Alcohol_Mean(% by Volume)")+
  theme(plot.title = element_text( color="#9B1807", face="bold", hjust=0)) +
  theme(axis.title = element_text( color="#9B1807")) 

p2 <- ggplot(aes(x=quality, y = alcohol_mean), data = df_by_quality)+
  geom_line(color = '#3F9B5D')+
  scale_x_continuous(breaks = seq(3,10,1))+
  ggtitle("Alcohol Mean with Quality")+
  labs(x = "Quality", y = "Alcohol_Mean(% by Volume)")+
  theme(plot.title = element_text(color="#9B1807",  face="bold", hjust=0)) +
  theme(axis.title = element_text(color="#9B1807")) 

  

grid.arrange(p1,p2,ncol=2)

Description One

The most definite and direct relationship between Alcohol and Quality. Good quality wines have more alcoholic content in them as compared to lower wines.

There seems to be a weird drop of alcohol at quality = 5, but otherwise, it’s clearly evident from both the plots that Good wines have a greater content of alcohol.

Plot Two

ggplot(aes(x = residual.sugar, y = density), data = df)+
  ylim(c(quantile(df$density,0.05),quantile(df$density,0.95)))+
  xlim(c(quantile(df$residual.sugar,0.05),quantile(df$residual.sugar,0.95)))+
  geom_point(aes(size = alcohol.bucket, color = quality.bucket))+
  ggtitle("Density and Residual Sugar in White Wine")+
  labs(x = "Residual Sugar (g/dm^3)", y = "Density (g/cm^3)")+
  theme(plot.title = element_text(color="#9B1807",  face="bold", hjust=0)) +
  theme(axis.title = element_text(color="#9B1807")) 

Description Two

This Multivariate plot represents the relationship between four attributes:

Density (y-axis)

Residual Sugar (x-axis)

Alcohol (shape of circle points)

and Quality (color of circle points)

The relationship between them is evident. Majority of Good quality wines have low density, low residual sugar, and high alcohol content.

Plot Three

p_d_ts1 <- ggplot(aes(x = density , y = total.sulfur.dioxide), data = df)+
  geom_point(aes(size = alcohol.bucket, color = alcohol.bucket))+
  xlim(c(quantile(df$density,0.05),quantile(df$density,0.95)))+
  ggtitle("Density and Sulphur-dioxide in White Wine")+
  labs(x = "Density (g/cm^3)", y = "Total Sulphur-dioxide (mg / dm^3)")+
  theme(plot.title = element_text(color="#9B1807",  face="bold", hjust=0)) +
  theme(axis.title = element_text(color="#9B1807")) 

  

p_d_ts2 <- ggplot(aes(x = density , y = total.sulfur.dioxide), 
       data = subset(df,df$alcohol.bucket == 'High'| df$alcohol.bucket=='Low'))+
  geom_point(aes(size = alcohol.bucket, color = alcohol.bucket))+
  xlim(c(quantile(df$density,0.05),quantile(df$density,0.95)))+
  ggtitle("Density and Sulphur-dioxide in White Wine")+
  labs(x = "Density (g/cm^3)", y = "Total Sulphur-dioxide (mg / dm^3)")+
  theme(plot.title = element_text(color="#9B1807",  face="bold", hjust=0)) +
  theme(axis.title = element_text(color="#9B1807")) 

grid.arrange(p_d_ts1, p_d_ts2, ncol = 2)

Description Three

Sulphur dioxide in the wine seems to play an important role in determining the quality of wine as it affects both density and alcohol directly.

More SO2 in wine results in :

  1. Increases Density
  2. Evident Taste and Smell
  3. bound.So2 has negative correlation with alcohol

And Hence, lower the quality of the wine


Reflection

Exploration of White wines data set came out to be pretty interesting.

With 12 attributes, it was intriguing to find which attributes effects the quality of the wines and which not. We came to conclusion mostly all of them contributed to the quality of wine directly or indirectly, slightly or majorly.

Alcohol was the most potential attribute to determine the quality of the wine. Chlorides, fixed acidity, pH, bound SO2 etc were somehow related to alcohol which contributed to quality.

Density and So2 were other potential attributes. Total.residual sugar, free So2, citric acid etc were linked to them.

At the end we came to the conclusion that the majority of Good Wines have :

high alcohol content, low density, low residual sugar, low SO2 content of any kind, low chlorides and are little basic than other wines. They too have a mediocre amount of citric acid.

We still didn’t find any substantial relationship of sulfates and volatile acidity with other variables.

The most surprising attribute was bound.SO2 which was created from total.SO2. Surprisingly it has good correlation with alcohol and density.

Citric acid was both weird and surprising. Citric acid is used to add freshness to the wine. More of it and even less of it degrades the quality of the wine. It should be present in just the right amount.

Logistic Regression or a classifier model can be applied to it for future work.

I hope you as a reader had an informative learning. Leave a comment if you find something wrong. Good Luck performing EDA in your project! Share the article if you liked it!

Share this article!

Akshay Chaudhary

Akshay Chaudhary

"If Information is the oil of the 21st century then analytics is the combustion engine”
A senior year engineering student on his learning adventure to become a Data Scientist.
Akshay Chaudhary

Akshay Chaudhary

"If Information is the oil of the 21st century then analytics is the combustion engine” A senior year engineering student on his learning adventure to become a Data Scientist.

Leave a Reply

Your email address will not be published. Required fields are marked *