Correlation is a condition that indicates the extent to which two variables change in a coordinate fashion.
 
This does not imply they are functionally linked or causal in nature.

Example Data

tibble( Year = 1999:2009, 
            `Nicolas Cage Movies` = c( 2, 2, 2, 3, 
                                       1, 1, 2, 3, 
                                       4, 1, 4),
            `Drowning Deaths in Pools` = c( 109, 102, 102, 
                                            98, 85, 95, 
                                            96, 98, 123, 
                                            94, 102 )) -> df
head(df)
# A tibble: 6 × 3
   Year `Nicolas Cage Movies` `Drowning Deaths in Pools`
  <int>                 <dbl>                      <dbl>
1  1999                     2                        109
2  2000                     2                        102
3  2001                     2                        102
4  2002                     3                         98
5  2003                     1                         85
6  2004                     1                         95

 

 

The Correlation Test


    Pearson's product-moment correlation

data:  df$`Nicolas Cage Movies` and df$`Drowning Deaths in Pools`
t = 2.6785, df = 9, p-value = 0.02527
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.1101273 0.9045101
sample estimates:
      cor 
0.6660043 

Beer Judge Certification Program

Recognized Beer Styles

  • 100 Distinct Styles (not just IPA’s & that yellow American Corn Lager!)

  • Global & Regional Styles

  • Quantitative Characteristics

    • IBU, SRM, ABV, OG, FG
  • Qualitative Characteristics

    • Overall Impression, Aroma, Appearance, Flavor, & Mouthfeel

 

The BJCP Style Guidelines exist for beer, mead, and ciders.

Basic Yeast Types

 

 

Not including sour beers, which use yeast and bacteria mixtures.

Dissolved Sugars - OG & FG

The more sugar in the wort, the more food for the yeast to work on, and the more alcohol that may be produced.

The difference between the gravities before and after fermentation can be used to estimate ABV.

 

Bitterness

Bitterness is created by the addition of herbs.

Color

The color of the beer is quantified using the Standard Reference Method (SRM) scale.

The Data

Here are the raw characteristic data for the different styles.

    Styles             Yeast       ABV_Min         ABV_Max      
 Length:100         Ale   :69   Min.   :2.400   Min.   : 3.200  
 Class :character   Either: 4   1st Qu.:4.200   1st Qu.: 5.475  
 Mode  :character   Lager :27   Median :4.600   Median : 6.000  
                                Mean   :4.947   Mean   : 6.768  
                                3rd Qu.:5.500   3rd Qu.: 8.000  
                                Max.   :9.000   Max.   :14.000  
    IBU_Min         IBU_Max          SRM_Min         SRM_Max     
 Min.   : 0.00   Min.   :  8.00   Min.   : 2.00   Min.   : 3.00  
 1st Qu.:15.00   1st Qu.: 25.00   1st Qu.: 3.50   1st Qu.: 7.00  
 Median :20.00   Median : 35.00   Median : 8.00   Median :17.00  
 Mean   :21.97   Mean   : 38.98   Mean   : 9.82   Mean   :17.76  
 3rd Qu.:25.00   3rd Qu.: 45.00   3rd Qu.:14.00   3rd Qu.:22.00  
 Max.   :60.00   Max.   :120.00   Max.   :30.00   Max.   :40.00  
     OG_Min          OG_Max          FG_Min          FG_Max     
 Min.   :1.026   Min.   :1.032   Min.   :0.998   Min.   :1.006  
 1st Qu.:1.040   1st Qu.:1.052   1st Qu.:1.008   1st Qu.:1.012  
 Median :1.046   Median :1.060   Median :1.010   Median :1.015  
 Mean   :1.049   Mean   :1.065   Mean   :1.009   Mean   :1.016  
 3rd Qu.:1.056   3rd Qu.:1.075   3rd Qu.:1.010   3rd Qu.:1.018  
 Max.   :1.080   Max.   :1.130   Max.   :1.020   Max.   :1.040  

Field Trip!

 

 

Parameter

Estimating the real value created by the the entire population of entities.

  • Mean of the real population, \(\mu\).

  • Variance of the real population, \(\sigma^2\)

Estimators

The values we get by sampling from the much larger population to gain inferences

  • Sample mean, \(\bar{x}\)

  • Sample variance, \(s^2\)

Parametric Assumptions

Much of the way we determine the significance of a model is based upon assumptions of the underlying data.

  • Normality

  • Independence

  • Homoscedasticity

Normality

In general, data we work with is assumed to follow a Normal distribution with parameters \(\mu\) and \(\sigma\), often denoted as \(N(\mu,\sigma)\), which can be parameterized as:

 

\[ f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x - \mu}{\sigma})} \]

Visualizing Normality

Visualizing the ‘normality’ of the data using built-in functions.

qqnorm( beer$ABV_Min )
qqline( beer$ABV_Min, col="red")

Testing for Normality

The Shapiro-Wilk test for normality has the null hypothesis \(H_O: Data\;are\;normal\), using the W statistic.

 

\(W = \frac{\left(\sum_{i=1}^Na_iR_{x_i}\right)^2}{\sum_{i=1}^N(x_i - \bar{x})^2}\)

 

where \(N\) is the number of samples, \(a_i\) is a standardizing coeeficient, \(x_i\) is the \(i^{th}\) value of \(x\), \(\bar{x}\) is the mean of the observed values, and \(R_{x_i}\) is the rank of the \(x_i^{th}\) observation.

Shapiro Wilkes Test

The default test for this in R is performed by the shapiro.test() function. Here, we will look at the minimum ABV value from the beer dataset.

shapiro.test( beer$ABV_Min )

    Shapiro-Wilk normality test

data:  beer$ABV_Min
W = 0.94595, p-value = 0.0004532

\(H_O: Data\;are\;normal\)

Is Minimum ABV Normal?

Data Transformations

ArcSin Square Root

Fractions and Percentages are known to behave poorly, particularly around the edges (e.g., close to 0 or 1). It is not uncommon to use a simple ArcSin Square Root transformation to try to help fractional data.

abv <- beer$ABV_Min / 100.0
asin( sqrt( abv ) ) -> abv.1
shapiro.test( abv.1)

    Shapiro-Wilk normality test

data:  abv.1
W = 0.96746, p-value = 0.01418

Conclusion? Are these data normal?

Data Transformations - Box Cox

There is a family of transformations that can be used to see if we can help data sets tend towards normality for parametric analyses.

\[ \tilde{x} = \frac{x^\lambda - 1}{\lambda} \]

Data Transformations - Box Cox

test_boxcox <- function( x, lambdas = seq(-1.1, 1.1, by = 0.015) ) {
  ret <- data.frame( Lambda = lambdas,
                     W = NA,
                     P = NA)
  
  for( lambda in lambdas ) {
    x.tilde <- (x^lambda - 1) / lambda   
    w <- shapiro.test( x.tilde )
    ret$W[ ret$Lambda == lambda ] <- w$statistic
    ret$P[ ret$Lambda == lambda ] <- w$p.value
  }
  
  return( ret )
}

Data Transformations - Box Cox

vals <- test_boxcox( beer$ABV_Min ) 
vals |> ggplot( aes(Lambda, P) ) + geom_line()

Equality of Variance

It is assumed that the variance of the data are

Independence of Data

The samples you collect, and the way that you design your experiments are most important to ensure that your data are individually independent. You need to think about this very carefully as you design your experiments.

Parametric Correlation

The Pearson Product Moment Correlation

 

\(\rho = \frac{\sum_{i=1}^N(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^N(x_i - \bar{x})^2}\sqrt{\sum_{i=1}^N(y_i - \bar{y})^2}}\)

 

whose values are confined to be within the range \(-1.0 \le \rho \le +1.0\)

 

Whose significance is tested by using a variant of the t.test:

\(t = r \frac{N-1}{1-r^2}\)

Visual Examples

Figure 1: Data and associated correlation statistics.

The Correlation Test

As we’ve used several times so far, the cor.test() function performs simple correlation analysis, with a deafult Pearson Product Moment analysis.

cor.test( beer$OG_Max, beer$FG_Max ) -> OG.FG.pearson
 OG.FG.pearson

    Pearson's product-moment correlation

data:  beer$OG_Max and beer$FG_Max
t = 15.168, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7671910 0.8878064
sample estimates:
      cor 
0.8374184 

The Correlation Test

Again, the object that is returned is a list and has these components.

names( OG.FG.pearson )
[1] "statistic"   "parameter"   "p.value"     "estimate"    "null.value" 
[6] "alternative" "method"      "data.name"   "conf.int"   

Non-Parametric Correlation - Spearman’s Rho

To alleviate some of the underlying parametric assumptions, we can use ranks of the data instead of the raw data directly.

\(\rho_{Spearman} = \frac{ \sum_{i=1}^N(R_{x_i} - \bar{R_{x}})(R_{y_i} - \bar{R_{y}})}{\sqrt{\sum_{i=1}^N(R_{x_i} - \bar{R_{x}})^2}\sqrt{\sum_{i=1}^N(R_{y_i} - \bar{R_{y}})^2}}\)

Non-Parametric Correlation - Spearman’s Rho

OG.FG.spearman <- cor.test( beer$OG_Max, beer$FG_Max, 
                            method = "spearman" )
OG.FG.spearman

    Spearman's rank correlation rho

data:  beer$OG_Max and beer$FG_Max
S = 39257, p-value < 2.2e-16
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho 
0.7644328 

Important

Compare P-Values to Pearson Product Momement.

Permutation for Significance.

Another way to circumvent some of the constraints based upon the form of the data, we can use permutation to test significance.

\(H_O: \rho = 0\)

Setting up Permutations

Permutation requires that we do some simple simulation work, permuting the data assuming the Null Hypothesis is TRUE.

 

x <- beer$OG_Max
y <- beer$FG_Max
df <- data.frame( Estimate = factor( c( "Original",
                                        rep("Permuted", 999))), 
                  rho =  c( cor.test( x, y )$estimate,
                            rep(NA, 999)) )

Setting Up the Permutation

 

head( df )
  Estimate       rho
1 Original 0.8374184
2 Permuted        NA
3 Permuted        NA
4 Permuted        NA
5 Permuted        NA
6 Permuted        NA

Permuting “Under the Null”

Now, we can go through the 999 NA values we put into that data frame and:
1. Permute one of the variables
2. Run the analysis
3. Store the statistic.

Tip

\(H_O:\) There is no correlation.

If \(H_O\) is correct then every permuted value should be as large as the original one.

Permuting “Under the Null”

Shuffle the y variable and recalculate the test statistic 999 times.

for( i in 2:1000) {
  yhat <- sample( y,   # this shuffles the data in y
                  size = length(y), 
                  replace = FALSE)
  model <- cor.test( x, yhat )
  df$rho[i] <- model$estimate 
}

Visualizing the NULL

Probability of a value as extreme or greater than the original estimate \(\to\) P-value.

ggplot( df ) + 
  geom_histogram( aes(rho, fill=Estimate ) )

Conclusions

Measures of correlation determine the co-movement of two variables without making any statement regarding causation.

  1. Parametric methods dependent upon normality.
  2. Transformations are available.
  3. Non-Parametric & Permutation approaches for more difficult data.

Questions

If you have any questions, please feel free to either post them as an “Issue” on your copy of this GitHub Repository, post to the Canvas discussion board for the class, or drop me an email.

Peter Sellers looking bored