Correlation is one of the most widely used and a well-known measure of the assocation (linear association, that is) of two variables.
Perhaps less well-known is that the correlation is in principle identical to the covariation.
To see this, consider
the a formula of the covariance of two empirical datasets, and :
In other words, the covariance of $X$ and $Y$ $COV(X,Y)$ is the average of difference of some value to its mean.
This idea is conveyed by this picture:
The covariance is identical to the correlation (?)
What does it mean to say the (coefficient of) correlation is “identical” to the covariation?
If we “feed” z-scaled values to the covariation, we will get back the correlation.
In other words, the correlation equals the covariation if the data are z-scaled.
So, let’s see. We replace by and by and see what happens.
However, , and by analogy, . So the eqaution simplifies to
Now, can be expressed as
The same rule applies for , by analogy.
Now, let’s insert the previous equation in the equation of :
can be pulled out of the sum, right at the front of the equation, leaving us with
And that’s the definition of the correlation of and , more frequently put this way:
It is helpful to consider an example.
This is a scatterplot of two variables, ie., “raw data” as is “fed in” for the calculation of the (empirical) covariation:
library(tidyverse) mtcars %>% ggplot + aes(x = hp, y = mpg) + geom_point()
And now, let’s z-scale the two variables and draw the same diagram again:
mtcars %>% select(hp, mpg) %>% mutate_all(funs(scale)) %>% ggplot + aes(x = hp, y = mpg) + geom_point()
Now, what’s the difference? Nada, no difference. That’s reassuring, because we just derived that the assocation of the variables is the same - no matter if use the raw data or z-scaled data as input. The diagrams confirms this in an more intuitive way.
The correlation is a “special case” of the covariance; it is the case when we feed z-scaled data to the covariance.
Happy data analyzing!