Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

What to Do with Residuals from a Logistic Regression

DZone's Guide to

What to Do with Residuals from a Logistic Regression

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

I always claim that graphs are important in econometrics and statistics! Of course, it is usually not that simple. Let me come back to a recent experience. I got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals obtained from a logistic regression. To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model. This is why I decided to embed the code in an R source file).

> source("http://freakonometrics.free.fr/probit.R")
> reg=glm(Y~X1+X2,family=binomial)

If we use R’s diagnostic plot, the first one is the scatterplot of the residuals against predicted values (the score, actually):

> plot(reg,which=1)

which is simply:

> plot(predict(reg),residuals(reg))
> abline(h=0,lty=2,col="grey")

Why do we have those two lines of points? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points), and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship … we can see more clearly what’s going on when we use colors.

> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> abline(h=0,lty=2,col="grey")

Points are exactly on a smooth curve, as a function of the predicted value,

Now, there is nothing from this graph. If we want to understand, we have to run a local regression to see what’s going on.

> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)

This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend that the plain dark line is very close to the dotted line ?

> rl=lm(residuals(reg)~bs(predict(reg),8))
> #rl=loess(residuals(reg)~predict(reg))
> y=predict(rl,se=TRUE)
> segments(predict(reg),y$fit+2*y$se.fit,predict(reg),y$fit-2*y$se.fit,col="green")

Yes, we can. And even if we have a guess that something can be done, what would this graph suggest?

Actually, that graph is probably not the only way to look at the residuals. What about plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get:

> plot(X2,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X2,residuals(reg)),col="black",lwd=2)
> lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

The graph is similar to the one we had earlier, and again, there is not much to say,

If we now look at the relationship with the first one, it starts to be more interesting:

> plot(X1,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X1,residuals(reg)),col="black",lwd=2)
> lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

We can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect:

Now, if we run a regression including this quadratic effect, what do we have?

> reg=glm(Y~X1+I(X1^2)+X2,family=binomial)
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red")
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
> abline(h=0,lty=2,col="grey")

Actually, it looks like we're back where we were initially ... so what is my point? My point is that:

  • graphs (yes, plural) can be used to see what might go wrong and to get more intuition about possible non linear transformation
  • graphs are not everything, and they will never be perfect! Here, in theory, the plain line should be a straight line, horizontal. But we also want a model that's as simple as possible. So, at some stage, we should probably give up and rely on statistical tests and confidence intervals. Yes, an almost-flat line can be interpreted as flat.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}