Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

How R2 Error Is Calculated in Generalized Linear Models

DZone's Guide to

How R2 Error Is Calculated in Generalized Linear Models

Learn how the R2 error is calculated for an H2O GLM (generalized linear model) — which awesomely uses the same math or any other statistical model!

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

According to this article:

R-squared is a statistical measure of how close the data are to a fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. [...] "100%" indicates that the model explains all the variability of the response data around its mean. 

You can get the full working Jupyter notebook for this article here directly from my GitHub.

This article explains how the R2 error is calculated for an H2O GLM (generalized linear model); note that the same math is used for any other statistical model. So, you can use this function anywhere you would want to apply it.

Let's build an H2O GLM model first:

import h2o
from h2o.estimators.glm import H2OGeneralizedLinearEstimator

h2o.init()

local_url = "https://raw.githubusercontent.com/h2oai/sparkling-water/master/examples/smalldata/prostate.csv"
df = h2o.import_file(local_url)

y = "CAPSULE"
feature_names = df.col_names
feature_names.remove(y)

df_train, df_valid, df_test = df.split_frame(ratios=[0.8,0.1])
print(df_train.shape)
print(df_valid.shape)
print(df_test.shape)

prostate_glm = H2OGeneralizedLinearEstimator(model_id = "prostate_glm")

prostate_glm.train(x = feature_names, y = y, training_frame=df_train, validation_frame=df_valid)
prostate_glm

Calculate model performance based on training, validation, and test data:

train_performance = prostate_glm.model_performance(df_train)
valid_performance = prostate_glm.model_performance(df_valid)
test_performance = prostate_glm.model_performance(df_test)

Check the default R2 metrics for training, validation, and test data:

print(train_performance.r2())
print(valid_performance.r2())
print(test_performance.r2())
print(prostate_glm.r2())

Get the prediction for the test data, which we kept separate:

predictions = prostate_glm.predict(df_test)

Here is the math used to calculate the R2 metric for the test dataset:

SSE = ((predictions-df_test[y])**2).sum()
y_hat = df_test[y].mean()
SST = ((df_test[y]-y_hat[0])**2).sum()
1-SSE/SST

Get the model performance for the given test data as shown below:

print(test_performance.r2())

Above we can see that both values — one given by model performance for test data and the other calculated by us — are same.

That's it; enjoy!

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:
h2o ,python ,glm ,big data ,r2 ,tutorial

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}