Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

### A data scientist and programmer looks at how to R to analyze data from quantile regressions and visual that data in graphs.

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Learn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

After my series of posts on classification algorithms, it’s time to get back to R, this time for quantile regression. Yes, I still want to get a better understanding of optimization routines in R. Before looking at the quantile regression, let us compute the median, or the quantile, from a sample.

## Median

Consider a sample {y1,⋯,yn}. To compute the median, solve:

which can be solved using linear programming techniques. More precisely, this problem is equivalent to

ai,bi≥0 and yiμ=aibii=1,⋯,n.

To illustrate, consider a sample from a lognormal distribution,

n = 101
set.seed(1)
y = rlnorm(n)
median(y)
[1] 1.077415

For the optimization problem, use the matrix form, with 3n constraints, and 2n+1 parameters:

library(lpSolve)
A1 = cbind(diag(2*n),0)
A2 = cbind(diag(n), -diag(n), 1)
r = lp("min", c(rep(1,2*n),0),
rbind(A1, A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,1) [1] 1.077415 It looks like it’s working well… ## Quantile Of course, we can adapt our previous code for quantiles: tau = .3 quantile(x,tau) 30% 0.6741586 The linear program is now with ai,bi≥0 and yiμ=aibii=1,⋯,n. The R code is now: A1 = cbind(diag(2*n),0) A2 = cbind(diag(n), -diag(n), 1) r = lp("min", c(rep(tau,n),rep(1-tau,n),0), rbind(A1, A2),c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,1)
[1] 0.6741586

So far so good…

## Quantile Regression (Simple)

Consider the following dataset, with rents of flats in a major German city, as a function of the surface, the year of construction, etc.

base=read.table("http://freakonometrics.free.fr/rent98_00.txt",header=TRUE)

The linear program for the quantile regression is now

with ai,bi≥0 and

yi−[β0τ+β1τxi]=aibi

i=1,⋯,n. So use here

require(lpSolve)
tau = .3
n=nrow(base)
X = cbind( 1, base$area) y = base$rent_euro
A1 = cbind(diag(2*n), 0,0)
A2 = cbind(diag(n), -diag(n), X)
r = lp("min",
c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2),
c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y))
tail(r$solution,2) [1] 148.946864 3.289674 Of course, we can use an R function to fit that model: library(quantreg) rq(rent_euro~area, tau=tau, data=base) Coefficients: (Intercept) area 148.946864 3.289674 Here again, it seems to work quite well. We can use a different probability level, of course, and get a plot. plot(base$area,base$rent_euro,xlab=expression(paste("surface (",m^2,")")), ylab="rent (euros/month)",col=rgb(0,0,1,.4),cex=.5) sf=0:250 yr=r$solution[2*n+1]+r$solution[2*n+2]*sf lines(sf,yr,lwd=2,col="blue") tau = .9 r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0), rbind(A1, A2), c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,2)
[1] 121.815505   7.865536
yr=r$solution[2*n+1]+r$solution[2*n+2]*sf
lines(sf,yr,lwd=2,col="blue")

## Quantile Regression (Multiple)

Now that we understand how to run the optimization program with one covariate, why not try with two? For instance, let us see if we can explain the rent of a flat as a (linear) function of the surface and the age of the building.

require(lpSolve)
tau = .3
n=nrow(base)
X = cbind( 1, base$area, base$yearc )
y = base$rent_euro A1 = cbind(diag(2*n), 0,0,0) A2 = cbind(diag(n), -diag(n), X) r = lp("min", c(rep(tau,n), rep(1-tau,n),0,0,0), rbind(A1, A2), c(rep("&gt;=", 2*n), rep("=", n)), c(rep(0,2*n), y)) tail(r$solution,3)
[1] 0.000000 3.257562 0.077501

Unfortunately, this time, it is not working well…

library(quantreg)
rq(rent_euro~area+yearc, tau=tau, data=base)
Coefficients:
(Intercept)         area        yearc
-5542.503252     3.978135     2.887234

Results are quite different. And, actually, another technique can confirm the latter (IRLS – Iteratively Reweighted Least Squares)

eps = residuals(lm(rent_euro~area+yearc, data=base))
for(s in 1:500){
reg = lm(rent_euro~area+yearc, data=base, weights=(tau*(eps&gt;0)+(1-tau)*(eps&lt;0))/abs(eps))
eps = residuals(reg)
}
reg$coefficients (Intercept) area yearc -5484.443043 3.955134 2.857943 I could not figure out what went wrong with the linear program. Not only are the coefficients very different, but so are the predictions: yr = r$solution[2*n+1]+r$solution[2*n+2]*base$area+r$solution[2*n+3]*base$yearc
plot(predict(reg),yr)
abline(a=0,b=1,lty=2,col="red")

It’s now time to investigate….

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.

Topics:
big data ,quantile regression ,r

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.