DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

GLMs: Link vs. Distribution

Are you a math nerd and lover of all things data science? Then this post if for you! Read on for the insights of a data expert.

Arthur Charpentier user avatar by
Arthur Charpentier
·
Nov. 30, 18 · Tutorial
Like (3)
Save
Tweet
Share
6.23K Views

Join the DZone community and get the full member experience.

Join For Free

Usually, when I give a course on GLMs, I try to insist on the fact that the link function is probably more important than the distribution. In order to illustrate, consider the following dataset, with 5 observations

x = c(1,2,3,4,5)
y = c(1,2,4,2,6)
base = data.frame(x,y)

Then consider several model, with various distributions, and either an identity link; and in that case

Image title

or a log link function, so that:

Image title

regNId = glm(y~x,family=gaussian(link="identity"),data=base)
regNlog = glm(y~x,family=gaussian(link="log"),data=base)
regPId = glm(y~x,family=poisson(link="identity"),data=base)
regPlog = glm(y~x,family=poisson(link="log"),data=base)
regGId = glm(y~x,family=Gamma(link="identity"),data=base)
regGlog = glm(y~x,family=Gamma(link="log"),data=base)
regIGId = glm(y~x,family=inverse.gaussian(link="identity"),data=base)
regIGlog = glm(y~x,family=inverse.gaussian(link="log"),data=base

One can also consider some Tweedie distribution, to be even more general:

library(statmod)
regTwId = glm(y~x,family=tweedie(var.power=1.5,link.power=1),data=base)
regTwlog = glm(y~x,family=tweedie(var.power=1.5,link.power=0),data=base)

Consider the prediction obtained in the first case, with the linear link function:

library(RColorBrewer)
darkcols = brewer.pal(8, "Dark2")
plot(x,y,pch=19)
abline(regNId,col=darkcols[1])
abline(regPId,col=darkcols[2])
abline(regGId,col=darkcols[3])
abline(regIGId,col=darkcols[4])
abline(regTwId,lty=2)

The predictions are very very close, aren't they? In the case of the exponential prediction, we obtain:

plot(x,y,pch=19)
u=seq(.8,5.2,by=.01)
lines(u,predict(regNlog,newdata=data.frame(x=u),type="response"),col=darkcols[1])
lines(u,predict(regPlog,newdata=data.frame(x=u),type="response"),col=darkcols[2])
lines(u,predict(regGlog,newdata=data.frame(x=u),type="response"),col=darkcols[3])
lines(u,predict(regIGlog,newdata=data.frame(x=u),type="response"),col=darkcols[4])
lines(u,predict(regTwlog,newdata=data.frame(x=u),type="response"),lty=2)

We can actually look closer. For instance, in the linear case, consider the slope obtained with a Tweedie model (that will include all the parametric familes mentioned here, actually).

pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base))$coefficients[2,1:2]
Vgamma = seq(-.5,3.5,by=.05)
Vpente = Vectorize(pente)(Vgamma)
plot(Vgamma,Vpente[1,],type="l",lwd=3,ylim=c(.965,1.03),xlab="power",ylab="slope")

The slope here is always very very close to one! Even more if we add a confidence interval:

plot(Vgamma,Vpente[1,])
lines(Vgamma,Vpente[1,]+1.96*Vpente[2,],lty=2)
lines(Vgamma,Vpente[1,]-1.96*Vpente[2,],lty=2)

Heuristically, for the Gamma regression, or the Inverse Gaussian one, because the variance is a power of the prediction, if the prediction is small (here on the left), the variance should be small. So, on the left of the graph, the error should be small with a higher power for the variance function. And that's indeed what we observe here.

erreur=function(gamma) predict(glm(y~x,family=tweedie(var.power=gamma,link.power=1),data=base),newdata=data.frame(x=1),type="response")-y[x==1] 
Verreur = Vectorize(erreur)(Vgamma)
plot(Vgamma,Verreur,type="l",lwd=3,ylim=c(-.1,.04),xlab="power",ylab="error")
abline(h=0,lty=2)

Of course, we can do the same with the exponential models:

pente=function(gamma) summary(glm(y~x,family=tweedie(var.power=gamma,link.power=0),data=base))$coefficients[2,1:2]
Vpente = Vectorize(pente)(Vgamma)
plot(Vgamma,Vpente[1,],type="l",lwd=3)

Or, if we add the confidence bands, we obtain:

plot(Vgamma,Vpente[1,],ylim=c(0,.8),type="l",lwd=3,xlab="power",ylab="slope")
lines(Vgamma,Vpente[1,]+1.96*Vpente[2,],lty=2)
lines(Vgamma,Vpente[1,]-1.96*Vpente[2,],lty=2)

So here also, the "slope" is rather similar... And if we look at the error we make on the left part of the graph, we obtain:

erreur=function(gamma) predict(glm(y~x,family=tweedie(var.power=gamma,link.power=0),data=base),newdata=data.frame(x=1),type="response")-y[x==1] 
Verreur = Vectorize(erreur)(Vgamma)
plot(Vgamma,Verreur,type="l",lwd=3,ylim=c(.001,.32),xlab="power",ylab="error")

So my point is that the distribution is usually not the most important point on GLMs, even if chapters of books on GLMs are distribution based. But, as mentioned in an another post, if you consider a nonlinear transformation, like we have with GAMs, the story is more complicated.

Distribution (differential geometry) Links

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • 5 Tips for Optimizing Your React App’s Performance
  • The Enterprise, the Database, the Problem, and the Solution
  • Internal Components of Apache ZooKeeper and Their Importance
  • Application Assessment Questions for Migration Projects

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: