Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# Non-observable vs. Observable Heterogeneity Factor

· Big Data Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Recently, in the ACT2040 class (on non-life insurance), we’ve discussed the difference between observable and non-observable heterogeneity in ratemaking (from an economic perspective). To illustrate that point (we will spend more time, later on, discussing observable and non-observable risk factors), we looked at the following simple example. Let  $X$ denote the height of a person. Consider the following dataset:
```> Davis=read.table(
+ "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Davis```

There is a small typo in the dataset, so let us make manual changes here.

`> Davis[12,c(2,3)]=Davis[12,c(3,2)] `

Here, the variable of interest is the height of a given person:

`> X=Davis\$height `

If we look at the histogram, we have:

`> hist(X,col="light green", border="white",proba=TRUE,xlab="",main="")`

Can we assume that we have a Gaussian distribution?

$X\sim\mathcal{N}(\theta,\sigma^2)$ Maybe not … here, if we fit a Gaussian distribution, plot it, and add a kernel based estimator, we get:

```> (param <- fitdistr(X,"normal")\$estimate)
> f1 <- function(x) dnorm(x,param[1],param[2])
> x=seq(100,210,by=.2)
> lines(x,f1(x),lty=2,col="red")
> lines(density(X))```

If you look at that black line, you might think of a mixture, something like:

$X\sim p_1\cdot\mathcal{N}(\theta_1,\sigma_1^2)+p_2\cdot\mathcal{N}(\theta_2,\sigma_2^2)$

(using standard mixture notations). Mixtures are obtained when we have a non-observable heterogeneity factor: with probability $p_1$, we have a random variable $\mathcal{N}(\mu_1,\sigma_1^2)$ (call it type [1]), and with probability $p_2$, a random variable $\mathcal{N}(\mu_2,\sigma_2^2)$ (call it type [2]). So far, nothing new. And we can fit such a mixture distribution, such as:

```> library(mixtools)
> mix <- normalmixEM(X)
number of iterations= 335
> (param12 <- c(mix\$lambda[1],mix\$mu,mix\$sigma))
[1] 0.4002202 178.4997298 165.2703616 6.3561363 5.9460023```

If we plot that mixture of two Gaussian distributions, we get:

```> f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4])
+ (1-param12[1])*dnorm(x,param12[3],param12[5]) }
> lines(x,f2(x),lwd=2, col="red") lines(density(X))```

Not bad. Actually, we can try to maximize the likelihood with our own codes:

```> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[5]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,
+ 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5)
> bvec <- c(0,-1,0,0)
> constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)\$par

[1]   0.5996263 165.2690084 178.4991624   5.9447675   6.3564746```

Here, we include some constraints, to insurance that the probability belongs to the unit interval, and that the variance parameters remain positive. Note that we have something close to the previous output.

Let us try something a little bit more complex now. What if we assume that the underlying distributions have the same variance, namely:

$X\sim p_1\cdot\mathcal{N}(\theta_1,\sigma^2)+p_2\cdot\mathcal{N}(\theta_2,\sigma^2)$

In that case, we have to use the previous code, and make small changes:

```> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[4]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,0,0,0,0,0,1), 3, 4)
> bvec <- c(0,-1,0)
> (param12c= constrOptim(c(.5,160,180,10), logL, NULL, ui = Amat, ci = bvec)\$par)

[1]   0.6319105 165.6142824 179.0623954   6.1072614```

This is what we can do if we cannot observe the heterogeneity factor. But wait … we actually have some information in the dataset. For instance, we have the sex of the person. Now, if we look at histograms of height per sex, and kernel-based density estimator of the height per sex, we have

So, it looks like the height for male, and the height for female are different. Maybe we can use that variable, that was actually observed, to explain the heterogeneity in our sample. Formally, the idea is to consider a mixture, with an observable heterogeneity factor: the sex.

$X\sim p_H\cdot\mathcal{N}(\theta_H,\sigma_H^2)+p_F\cdot\mathcal{N}(\theta_F,\sigma_F^2)$

We now have interpretation of what we used to call class [1] and [2] previously: male and female. And here, estimating parameters is quite simple,

```>  (pM <- mean(sex=="M"))
[1] 0.44
>  (paramF <- fitdistr(X[sex=="F"],"normal")\$estimate)
mean         sd
164.714286   5.633808
>  (paramM <- fitdistr(X[sex=="M"],"normal")\$estimate)
mean         sd
178.011364   6.404001```

And if we plot the density, we have

```> f4 <- function(x) pM*dnorm(x,paramM[1],paramM[2])+(1-pM)*dnorm(x,paramF[1],paramF[2])
> lines(x,f4(x),lwd=3,col="blue")```

What if, once again, we assume identical variance? Namely, the model becomes:

$X\sim p_H\cdot\mathcal{N}(\theta_H,\sigma^2)+p_F\cdot\mathcal{N}(\theta_F,\sigma^2)$

Then a natural idea to derive an estimator for the variance, based on previous computations, is to use:

$\sigma^2=\frac{1}{n-2}\left(\sum_{i:H} [X_i-\overline{X}_H>^2+\sum_{i:F} [X_i-\overline{X}_F]^2\right)$

The code is here:

```> s=sqrt((sum((height[sex=="M"]-paramM[1])^2)+sum((height[sex=="F"]-paramF[1])^2))/(nrow(Davis)-2))
> s
[1] 6.015068```

and again, it is possible to plot the associated density,

```> f5 <- function(x) pM*dnorm(x,paramM[1],s)+(1-pM)*dnorm(x,paramF[1],s)
> lines(x,f5(x),lwd=3,col="blue")```

Now, if we think a little about what we’ve just done, it is simply a linear regression on a factor, the sex of the person:

$X=\mu_H\cdot\boldsymbol{1}(H)+\mu_F\cdot\boldsymbol{1}(F)+\varepsilon$

where $\varepsilon\sim\mathcal{N}(0,\sigma^2)$.  And indeed, we can run the code to estimate this linear model:

```> summary(lm(height~sex,data=Davis))

Call:
lm(formula = height ~ sex, data = Davis)

Residuals:
Min       1Q   Median       3Q      Max
-16.7143  -3.7143  -0.0114   4.2857  18.9886

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 164.7143     0.5684  289.80   <2e-16 ***
sexM         13.2971     0.8569   15.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.015 on 198 degrees of freedom
Multiple R-squared:  0.5488,	Adjusted R-squared:  0.5465
F-statistic: 240.8 on 1 and 198 DF,  p-value: < 2.2e-16```

And we get the same estimators for the means and the variance as the ones obtained previously. So, as mentioned recently in class, if you have a non-observable heterogeneity factor, we can use a mixture model to fit a distribution, but if you can get a proxy of that factor that is observable, then you can run a regression. But most of the time, that observable variable is just a proxy of a non-observable one …

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.