Exploratory Data Analysis – White Wines Data – Part 3 – Bivariate Analysis

Share this article!

Bivariate Analysis

Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis, used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y.

This is the third part of the Exploratory Data Analysis Series.

The series is as follows:
Part 3 – Bivariate Analysis (this article)
Go through the previous parts to understand the data set (highly recommended).
In this section, we will be performing bivariate analysis on the white wines data set.

Techniques for Bi-variate Analysis

  1. Scatter Plots
  2. Line Plots
  3. Regression
  4. Correlation Coefficients

**Suggested Reading **: http://www.saedsayad.com/bivariate_analysis.htm

drops <- c("quality.bucket")
dfm <- df[ , !(names(df) %in% drops)]
dfm <- as.matrix(dfm)
dfm <- cor(dfm)
corrplot(dfm, method="number")

Correlation plot will help to identify a meaningful relationship between the variables.

Any two variables with the correlation coefficient greater than 0.4 can be meaningful in our analysis.

As I thought, newly created variable – “bound.sulphur.dioxide” has a good correlation with density and alcohol Other notable correlations are between:

density – residual.sugar

alcohol – quality

pH – fixed.acidity

total.SO2 – density

density – alcohol

chlorides – alcohol

total.SO2 – residual.sugar

Sulphates, Citric Acid, and Volatile Acidity has not yet any identified correlation with any of other variables

Reference: http://www.sthda.com/english/wiki/visualize-correlation-matrix-using-correlogram


This is matrix of all plots between every variable of the data set

 ggplot(aes(x = quality.bucket, y= alcohol), data = df) +
  geom_jitter( alpha = 1/4)  +
  geom_boxplot( alpha = .5,color = '#3F9B5D')+
  stat_summary(fun.y = "mean",geom = "point",color = "red", shape = 8,size = 4)

This boxplot between categories of quality and alcohol clearly indicate that Good quality wines have alcohol content higher than Poor and Average Quality wines.

x represents the mean mark.

Another thing to observe is mean alcohol of poor quality wines is slightly higher than average quality wines.

Moreover, alcohol and quality are positively correlated.

ggplot(aes(x = quality.bucket, y = df$density), data = df)+
  geom_boxplot(color = '#3F9B5D')+
  geom_point(stat = 'summary', fun.y = mean, color = 'red', pch = 4)

Good quality wines have lower density than both Poor and Average quality wines with some outliers near 1.

x represents the mean mark.

Poor and Average quality wines have similar median density but Average quality wines have an outlier at far end that might be because of that wine with residual sugar around 60

df_byquality <- group_by(df,quality)
df_by_quality <- summarise(df_byquality,alcohol_mean = mean(alcohol),
                           alcohol_median = median(alcohol),
                           density_mean = mean(density), 
                           density_median = median(density),
                           count = n())

p2 <- ggplot(aes(x=quality, y = alcohol_mean), data = df_by_quality)+
  geom_line(color = '#3F9B5D')+
  scale_x_continuous(breaks = seq(3,10,1))

To understand more about alcohol and quality relationship, I grouped my dataset by quality and plotted the mean of alcohol along the y-axis.

It’s surprising to see a dip of alcohol content at quality level 5.

After this plot, we are definitely sure about the relationship between alcohol and quality

ggplot(aes(x = density, y = alcohol), data = df)+
  geom_point(alpha = 1/3 , color = I('#3F9B5D'))+
  geom_smooth(method = 'lm' , color= '#AB4234')

With a great correlation of -0.78, alcohol and density are one of the greatest relationships we have. Outliers of density are removed from the plot for visualizing better relationship.

ggplot(aes(x = residual.sugar, y = density), data = df)+
  geom_point(color = '#3F9B5D')+
  geom_smooth(method = 'lm' , color= 'red')

Not surprised to see this relationship because this is a common science when we add sugar to something its density increases.

Hence, residual sugar is strongly and positively correlated with density( cor=0.84) which also affects alcohol (alcohol and density are negatively correlated ) which further can affect quality too.

ggplot(aes(x = residual.sugar, y = alcohol), data = df)+
  geom_point(alpha = 1/2, color = '#3F9B5D')+
  geom_line(stat = "summary", fun.y = mean, color='blue')

  geom_smooth(method = 'lm' , color= 'red')
## geom_smooth: na.rm = FALSE
## stat_smooth: na.rm = FALSE, method = lm, formula = y ~ x, se = TRUE
## position_identity

This plot represent the relationship between alcohol and residual.sugar. The blue line represents the alcohol mean which we can see is on declining path as residual sugar increases.

ggplot(aes(x = quality.bucket, y = residual.sugar), data = df)+
  geom_boxplot(color = '#3F9B5D')+
  geom_point(stat = 'summary', fun.y = mean, pch = 9, color = 'red')

This is not as we guessed. There seems to be not a very good relationship between residual.sugar and quality.

ggplot(aes(x = alcohol , y = bound.sulfur.dioxide), data = df)+
  geom_point(color = '#3F9B5D')+
  geom_smooth(method = 'lm' , color='red')

Bound.SO2 and alcohol seem to have slight but noticeable negative relationship.

ggplot(aes(x = density , y = total.sulfur.dioxide), data = df)+
  geom_jitter(color = '#3F9B5D')+
  geom_smooth(method = 'lm' , color='red')

There is a great positive correlation between the total.SO2 which is not good for the quality of the wine.

Also, free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine. And free SO2 is a component of Total.SO2.

So, we can conclude large amount of SO2 has three degrading effects in quality:

  1. Increases Density
  2. Evident Taste and Smell
  3. bound.So2 has negative correlation with alcohol
ggplot(aes(x = pH, y = fixed.acidity), data = df)+
  geom_point(alpha = 1/2, color = '#3F9B5D')+
  ylim(c(quantile(df$fixed.acidity,0.02), quantile(df$fixed.acidity,0.99)))+
  geom_smooth(method = 'lm' , color='red')

As pH is scale for measuring the acidity, it is somewhat obvious that fixed acidity will be negatively correlated with pH value.

Maybe the pH of the wine is dependent on fixed acidity, volatile acidity, and citric acid but it does not correlate with volatile acidity and citric acid as it correlates with fixed acidity.

ggplot(aes(x = quality.bucket, y = pH), data = df)+
  geom_boxplot(color = '#3F9B5D')+
  geom_point(stat = 'summary', fun.y = mean, pch = 9, color = 'red')

Its evident from the boxplot that Good wines are little more basic (less acidic) than Poor and Average wines.

ggplot(aes(x = chlorides, y = alcohol), data = df)+
  geom_point(alpha = 1/2, color = '#3F9B5D')+
  geom_smooth(method = 'lm' , color='red')

ggplot(aes(x = quality.bucket,y = chlorides), data = df)+
  geom_boxplot(color = '#3F9B5D')

We can conclude that amount of salt (chloride) and alcohol in wine are negatively correlated and the amount of chloride slightly affects the quality of the wine.

Good wines have slightly less chloride than Average and Poor Wines.

Let’s now summarise our above analysis by asking ourselves some important questions!

Let’s Talk about some of the relationships we observed in this part of the analysis. How did the feature(s) of interest vary with other features in the dataset?

We observed various relationships between alcohol, quality, density, residual sugar, SO2, pH, and chlorides. Quality is correlated mainly with alcohol and density, and other variables are correlated with either density and alcohol which indirectly affect the quality of the wine.

Good wines have a larger amount of alcohol and lower density. They have a lesser amount of chlorides, SO2 content and are little more basic than Average and Poor quality wines.

Did we observe any interesting relationships between the other features (not the main feature(s) of interest)?


Residual sugar and density have a great positive correlation.

Bound.SO2 and alcohol have a negative correlation.

As fixed acidity increases, pH value decreases.

total.SO2 has a positive correlation with density

Chlorides, residual sugar and density all three have negative correlation with alcohol

What was the strongest relationship we found?

The strongest relationship was found between:

  1. Alcohol and Quality
  2. Density and residual sugar
  3. Chlorides and alcohol
  4. total.SO2 and density
  5. bound.SO2 and alcohol

Hope you guys liked playing with the data set and performing bivariate analysis. Next up is Multivariate Analysis!

Share this article!

Akshay Chaudhary

Akshay Chaudhary

"If Information is the oil of the 21st century then analytics is the combustion engine”
A senior year engineering student on his learning adventure to become a Data Scientist.
Akshay Chaudhary

Akshay Chaudhary

"If Information is the oil of the 21st century then analytics is the combustion engine” A senior year engineering student on his learning adventure to become a Data Scientist.

Leave a Reply

Your email address will not be published. Required fields are marked *