Red Wine Dataset with R

Frankie Jay
Oct 21, 2021
2 min read

This was an assignment for my Real World Analytics subject which involves working with the red wine component of the Wine Quality Dataset. We had a simplified segment which contained 5 variables only:

X1 - Citric Acid
X2 - Chlorides
X3 - Total Sulfer Dioxide
X4 - Ph
X5 - Alcohol
Y - Quality Score

Our tasks was using #RStudio to evaluate scatter plots to investigate the general relationship between the variables, choose the 4 of them and transform them into a more normal distribution so we could build some models using a custom library for the subject.

We were to report on models using weighted arithmetic mean (WAM), weighted power mean (WPM) ordered weighted averaging function (OWA) and a choquet integral. (Obviously this is a live assignment, so I have to keep this post somewhat vague).

The original histograms is as follows, which we only had a segment of 500 instances.

We know that there is a correlation between Ph and Citric acid, and that an acidic wine would not be pleasant and have a lower quality score. It is also interesting to see that the alcohol seems to be the highest correlated to quality from the correlation plot below.

From this we chose the 4 variables to investigate further, which we then performed transformations on in order to create the more normalised distributions. In my case I used Best Normalize, Standardisation and Negation.

We were then required to create our 4 models, with some given parameters.

Choquet integral is the best performing model with the lowest RMSE score, and highest correlation scores, therefore seems to be the best fitting function. Correlations for all the functions however are unimpressive for both Pearson and Spearman correlations. And WAM performs better than the Weighted Power Means.

The sample was then fed into choquet integral model, with weightings and fuzzy measures from provided data used to undertake prediction.

¡This score seems reasonable as it falls within the mean values of the original dataset, however the modest correlation values of the model give us some hesitation.

The next task was the use linear regression to predict a quality score, which gave a similar prediction to the choquet variable. However still with a modest correlation value which arouses some doubt.

Finally I did a plot of predicted values from the Linear Regression model in blue, the Choquet integral in green and the actual Quality scores in red because I was initially a bit concerned about the low correlation values.

Although not perfect, they did seem to be clustered around the same range where the actual Quality scores occurred.

Comments