Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

R: Linear models with the lm function, NA values and Collinearity

DZone's Guide to

R: Linear models with the lm function, NA values and Collinearity

· Big Data Zone
Free Resource

Need to build an application around your data? Learn more about dataflow programming for rapid development and greater creativity. 

In my continued playing around with R I’ve sometimes noticed ‘NA’ values in the linear regression models I created but hadn’t really thought about what that meant.

On the advice of Peter Huber I recently started working my way through Coursera’s Regression Models which has a whole slide explaining its meaning:

What if we include an unnecessary variable? (.png image)

So in this case ‘z’ doesn’t help us in predicting Fertility since it doesn’t give us any more information that we can’t already get from ‘Agriculture’ and ‘Education’.

Although in this case we know why ‘z’ doesn’t have a coefficient sometimes it may not be clear which other variable the NA one is highly correlated with.

Multicollinearity (also collinearity) is a statistical phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a non-trivial degree of accuracy.

In that situation we can make use of the alias function to explain the collinearity as suggested in this StackOverflow post:

library(datasets); data(swiss); require(stats); require(graphics)
z <- swiss$Agriculture + swiss$Education
fit = lm(Fertility ~ . + z, data = swiss)
> alias(fit)
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality + z
 
Complete :
  (Intercept) Agriculture Examination Education Catholic Infant.Mortality
z 0           1           0           1         0        0

In this case we can see that ‘z’ is highly correlated with both Agriculture and Education which makes sense given its the sum of those two variables.

When we notice that there’s an NA coefficient in our model we can choose to exclude that variable and the model will still have the same coefficients as before:

> require(dplyr)
> summary(lm(Fertility ~ . + z, data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03
> summary(lm(Fertility ~ ., data = swiss))$coefficients
                   Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)      66.9151817 10.70603759  6.250229 1.906051e-07
Agriculture      -0.1721140  0.07030392 -2.448142 1.872715e-02
Examination      -0.2580082  0.25387820 -1.016268 3.154617e-01
Education        -0.8709401  0.18302860 -4.758492 2.430605e-05
Catholic          0.1041153  0.03525785  2.952969 5.190079e-03
Infant.Mortality  1.0770481  0.38171965  2.821568 7.335715e-03

If we call alias now we won’t see any output:

> alias(lm(Fertility ~ ., data = swiss))
Model :
Fertility ~ Agriculture + Examination + Education + Catholic + 
    Infant.Mortality

Check out the Exaptive data application Studio. Technology agnostic. No glue code. Use what you know and rely on the community for what you don't. Try the community version.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}