Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# What to Do with Residuals from a Logistic Regression

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

The Architect’s Guide to Big Data Application Performance. Get the Guide.

I always claim that graphs are important in econometrics and statistics! Of course, it is usually not that simple. Let me come back to a recent experience. I got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals obtained from a logistic regression. To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model. This is why I decided to embed the code in an R source file).

```> source("http://freakonometrics.free.fr/probit.R")
> reg=glm(Y~X1+X2,family=binomial)```

If we use R’s diagnostic plot, the first one is the scatterplot of the residuals against predicted values (the score, actually):

`> plot(reg,which=1)`

which is simply:

```> plot(predict(reg),residuals(reg))
> abline(h=0,lty=2,col="grey")```

Why do we have those two lines of points? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points), and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship … we can see more clearly what’s going on when we use colors.

```> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> abline(h=0,lty=2,col="grey")```

Points are exactly on a smooth curve, as a function of the predicted value,

Now, there is nothing from this graph. If we want to understand, we have to run a local regression to see what’s going on.

`> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)`

This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend that the plain dark line is very close to the dotted line ?

```> rl=lm(residuals(reg)~bs(predict(reg),8))
> #rl=loess(residuals(reg)~predict(reg))
> y=predict(rl,se=TRUE)
> segments(predict(reg),y\$fit+2*y\$se.fit,predict(reg),y\$fit-2*y\$se.fit,col="green")```

Yes, we can. And even if we have a guess that something can be done, what would this graph suggest?

Actually, that graph is probably not the only way to look at the residuals. What about plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get:

```> plot(X2,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X2,residuals(reg)),col="black",lwd=2)
> lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")```

The graph is similar to the one we had earlier, and again, there is not much to say,

If we now look at the relationship with the first one, it starts to be more interesting:

```> plot(X1,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X1,residuals(reg)),col="black",lwd=2)
> lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")```

We can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect:

Now, if we run a regression including this quadratic effect, what do we have?

```> reg=glm(Y~X1+I(X1^2)+X2,family=binomial)
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red")
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
> abline(h=0,lty=2,col="grey")```

Actually, it looks like we're back where we were initially ... so what is my point? My point is that:

• graphs (yes, plural) can be used to see what might go wrong and to get more intuition about possible non linear transformation
• graphs are not everything, and they will never be perfect! Here, in theory, the plain line should be a straight line, horizontal. But we also want a model that's as simple as possible. So, at some stage, we should probably give up and rely on statistical tests and confidence intervals. Yes, an almost-flat line can be interpreted as flat.

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.