Gentle intro to 'R-squared equals squared r'

It comes as no surprise that \(R^2\) (“coefficient of determination”) equals \(r^2\) in simple regression (predictor X, criterion Y), where \(r(X,Y)\) is Pearson’s correlation coefficient. \(R^2\) equals the fraction of explained variance in a simple regression. However, the statistical (mathematical) background is often less clear or buried in less-intuitive formula.

The goal of this post is to offer a gentle explanantion why

\(R^2 = r^2\),

where \(r\) is \(r(Y,\hat{Y})\) and \(\hat{Y}\) are the predicted values.

Some more notation: \(Y_i\) are the individual values of the criterion, and \(\bar{Y}\) is the mean of the criterion.

As an example, suppose we wanted to predict the grades (\(Y\)) of some students based on their studying time (\(X\)). Of course, our prediction won’t be perfect most of the time, so there will be gaps (errors or residuals) between what we predict and what the actual grade in some particular \(Y_i\) case will be. So

\(Y_i - \hat{Y_i} \ne 0\), or, similarly \((Y_i - \hat{Y_i})^2 > 0\).

Let’s start with the idea that the variance is additive, see here or here for details. The additivity of variance implies that the whole variation in the students’ grades is the sum of the variance of our prediction errors and the improvement of our prediction/ regression model:

\(\sum_i{(Y_i - \hat{Y})}^2 = \sum_i{(Y_i - \hat{Y_i})}^2 + \sum_i{(\hat{Y_i} - \bar{Y})}^2\).

In words:

Total = Error^2 + Improvement^2.

Or:

Total variance in simple regression equals error variance plus variance explained by model.

This expression can be rearranged to

1 = Error^2 / Total + Improvement^2 / Total.

As formula:

\(1 = \frac{\sum_i{(Y_i - \hat{Y_i})}^2}{\sum_i{(Y_i - \bar{Y})}^2} + \frac{\sum_i{(\hat{Y_i} - \bar{Y})}^2}{\sum_i{(Y_i - \bar{Y})}^2}\).

Puh, quite some stuff. Now we apply a property of least square regression stating that there is no correlation between the residuals (e) and the predicted value (we don’t discuss that property here further). More formally:

\(r \{ (Y_i - \hat{Y_i}, \hat{Y}) \} = 0\).

A bit more readable:

\(r(e, \hat{Y}) = 0\).

Now we write the (sample) correlation between actual (observed) and predicted (fitted) values as follows:

\[r(Y, \hat{Y}) = \frac{n^{-1} \sum_i({Y_i - \bar{Y}) \cdot (\hat{Y_i} - \bar{Y})}}{\sqrt{n^{-1} \sum_i{(Y_i - \bar{Y})^2} \cdot n^{-1} \sum_i{(\hat{Y_i} - \bar{Y})^2}}}\]

In other words,

\[r(Y, \hat{Y}) = \frac{Cov(Y, \hat{Y})} {\sqrt{Var(Y) \cdot Var(\hat{Y})}}\]

Note that \(n^{-1}\) cancel out, leaving

\[r(Y, \hat{Y}) = \frac{\sum_i({Y_i - \bar{Y}) \cdot (\hat{Y_i} - \bar{Y})}}{\sqrt{ \sum_i{(Y_i - \bar{Y})^2} \cdot \sum_i{(\hat{Y_i} - \bar{Y})^2}}}\]

The variance can be seen as the average of the squared sums (SS) of some delta (d): \(Var(X) = n^{-1}\sum{d}\). If we on our way “lose” the \(n^{-1}\) part (as it happened in the line above), we are left with the sum of squares (SS), as they are called. Now, as we are concerned with the variance of the error, variance of the model improvement and the total variance, we can in the same way speak of the according SS-terms. We will come back to the SS-terms later on.

Now a little trick. We add the term \(-\hat{Y_i} + \hat{Y_i}\), which is zero, so no harm caused. We get:

\[r(Y, \hat{Y}) = \frac{\sum_i({Y_i - \hat{Y_i}) + (\hat{Y_i} - \bar{Y})) \cdot (\hat{Y_i} - \bar{Y})}}{\sqrt{ \sum_i{(Y_i - \bar{Y})^2} \cdot \sum_i{(\hat{Y_i} - \bar{Y})^2}}}\]

Now stay close. Observe that we have three terms in the numerator, let’s call them \(a\), \(b\), and \(c\). We can multiplicate that out resulting in \((a+b)c = ac + bc\). So:

\[= \frac{\sum{(Y_i - \hat{Y_i})(\hat{Y_i} - \bar{Y}) + (\hat{Y_i} - \bar{Y})(\hat{Y_i} - \bar{Y})}}{D}\]

As the denominator did not change, we just abbreviated it in the obvious way. Note that the numerator now is of the form \(ac + cc\) (as b = c). Thus we can further simplify to

\[= \frac{\sum{\{ (Y_i - \hat{Y_i})(\hat{Y_i} - \bar{Y}) + (\hat{Y_i} + \bar{Y})^2 \}}}{D}\]

Referring back to our notion of the two parts of the total variance in regression we can think of the numerator above as \(\sum{(e \cdot m + m^2)}\), where e indicates the error or residual term, and m the model improvement. However, as the correlation between m and e (which is zero, see above) is nothing but average product of m and e, their sum also is zero. Thus we are left with

\[= \frac{\sum_i{(\hat{Y_i} - \bar{Y})^2}}{\sqrt{ \sum_i{(Y_i - \bar{Y})^2} \cdot \sum_i{(\hat{Y_i} - \bar{Y})^2}}}\]

Think of the numerator as a and then \(a = \sqrt{a^2}\), thus

\[= \sqrt{ \frac{\sum_i{(\hat{Y_i} - \bar{Y})^2 \cdot \sum_i(\hat{Y_i} - \bar{Y})^2}}{ \sum_i{(Y_i - \bar{Y})^2} \cdot \sum_i{(\hat{Y_i} - \bar{Y})^2}}}\]

After canceling out the obvious we are left with

\[= \sqrt{ \frac{\sum_i{(\hat{Y_i} - \bar{Y})^2}}{ \sum_i{(Y_i - \bar{Y})^2}}}\]

This term is nothing but

Model variance / Total variance.

And we are there! To summarise

\[r(Y, \hat{Y})^2 = \frac{SS_M}{SS_T} = R^2\]

OK, quite some ride. At least we arrived where we headed to. Some algebra kungfu turned out to be handy. It took me some time to decypher the algebra without any explanation. So I hope me comments make the voyage a bit shorter for you.

Without some expertise in algebra (multiplicating out, adding zero-terms, …) and some knowledge of regression properties we would not have arrived at the end. The good news are that each of the steps is not really difficult to understand (I am not saying it is easy to discover these steps).

Gentle intro to 'R-squared equals squared r'

January 20, 2017

This blog has moved

Wie gut schätzt eine Stichprobe die Grundgesamtheit?

Some thoughts on tidyveal and environments in R