R-Square Value Demystified
R-Square Value Demystified
I want to take few steps back to clear the fog with regards to the calculation of the R-Square statistic and kill the confusion around it
Join the DZone community and get the full member experience.Join For Free
Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.
As we all know, in today's world of quick results and insights, nobody wants to spend time on understanding the core concepts of certain statistical terms while performing an analytical routine. One statistical term that is talked about a lot but that does not have very well-understood mechanics is R-Square statistics (AKA coefficient of determination). This statistic helps to measure the closeness of the data to the fitted line of regression.
It is also worth mentioning that by squaring the correlation coefficient statistic, one can calculate the R-Square value. However, I want to take few steps back to clear the fog with regards to the calculation of this statistic and kill the confusion around it (I know this is quite an extreme statement). As they say, the devil is in the details...
So, let's get started.
What Is an R-Square Value?
To simply put it, it is:
Total sum of squares - residual sum of squares / total sum of squares
Quite a mouthful. Let's make it a bit simpler. It can also be defined as:
Error sum of squares / total sum of squares
Some of you might even chuckle and say, "But we don't even know how to calculate the above mumbo-jumbo!" And my answer to you all would be, "No worries; let me explain it with the help of mathematical notations!" Yes, mathematical notations!
Let's make the above calculations more obvious by looking at an example:
- Step 1: Calculating the mean value of the Y variable and total sum of squares
- Step 2: Calculating the residual sum of squares
- Step 3: Calculating the error sum of squares (sum of square error)
Once all the important elements are calculated, you are ready to compute the R-Square value.
Short-Form R Square Calculation Method = ESS / TSS
To prove the above calculations, I created a scatter plot chart in Excel and also cross-validated the information by using regression analysis with Excel's Data Analysis Tool Pack.
The next topic that might be a good candidate for further discussion related to the R-Square value is Adjusted R-Square. We usually use this statistic when we have multiple predictor variables. The standard R-Square value tends to increase with the number of predictor variables, which might not be a good way to look at the model's performance. Therefore, when working with multiple predictor variables, statisticians and analysts prefer to use the Adjusted R-Square.
Calculating this value is pretty simple and straight-forward. Let's look at its notation to understand the mechanics of this statistic.
- N = Number of points in the data sample
- P = Number of independent predictor variables or regressors, i.e. the number of variables in your model barring the constant
How does this work in our example?
It's pretty simple.
1-(1-0.04783 (R-Squared value)) * (16 (number of Y data points) - 1) / (16 -1 (predictor variables [if you have >1 predictor variables, you will mention that number here]) -1) = -0.02018
Check the Adjusted R-Square value from the Data Analysis Tool Pack's regression analysis results.You can download the working file by clicking here.
Published at DZone with permission of Sunil Kappal , DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.