Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What to Do with Residuals from a Logistic Regression

DZone's Guide to

What to Do with Residuals from a Logistic Regression

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I always claim that graphs are important in econometrics and statistics! Of course, it is usually not that simple. Let me come back to a recent experience. I got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals obtained from a logistic regression. To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model. This is why I decided to embed the code in an R source file).

> source("http://freakonometrics.free.fr/probit.R")
> reg=glm(Y~X1+X2,family=binomial)

If we use R’s diagnostic plot, the first one is the scatterplot of the residuals against predicted values (the score, actually):

> plot(reg,which=1)

which is simply:

> plot(predict(reg),residuals(reg))
> abline(h=0,lty=2,col="grey")

Why do we have those two lines of points? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points), and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship … we can see more clearly what’s going on when we use colors.

> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> abline(h=0,lty=2,col="grey")

Points are exactly on a smooth curve, as a function of the predicted value,

Now, there is nothing from this graph. If we want to understand, we have to run a local regression to see what’s going on.

> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)

This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend that the plain dark line is very close to the dotted line ?

> rl=lm(residuals(reg)~bs(predict(reg),8))
> #rl=loess(residuals(reg)~predict(reg))
> y=predict(rl,se=TRUE)
> segments(predict(reg),y$fit+2*y$se.fit,predict(reg),y$fit-2*y$se.fit,col="green")

Yes, we can. And even if we have a guess that something can be done, what would this graph suggest?

Actually, that graph is probably not the only way to look at the residuals. What about plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get:

> plot(X2,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X2,residuals(reg)),col="black",lwd=2)
> lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

The graph is similar to the one we had earlier, and again, there is not much to say,

If we now look at the relationship with the first one, it starts to be more interesting:

> plot(X1,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X1,residuals(reg)),col="black",lwd=2)
> lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

We can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect:

Now, if we run a regression including this quadratic effect, what do we have?

> reg=glm(Y~X1+I(X1^2)+X2,family=binomial)
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red")
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
> abline(h=0,lty=2,col="grey")

Actually, it looks like we're back where we were initially ... so what is my point? My point is that:

  • graphs (yes, plural) can be used to see what might go wrong and to get more intuition about possible non linear transformation
  • graphs are not everything, and they will never be perfect! Here, in theory, the plain line should be a straight line, horizontal. But we also want a model that's as simple as possible. So, at some stage, we should probably give up and rely on statistical tests and confidence intervals. Yes, an almost-flat line can be interpreted as flat.

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}