# Overdispersion with Different Exposures

# Overdispersion with Different Exposures

Join the DZone community and get the full member experience.

Join For FreeLearn how to operationalize machine learning and data science projects to monetize your AI initiatives. Download the Gartner report now.

In actuarial science, and insurance ratemaking, taking into account the **exposure** can be a nightmare (in datasets, some clients have been here for a few years – we call that exposure – while others have been here for a few months, or weeks). Somehow, simple results because more complicated to compute just because we have to take into account the fact that exposure is an heterogeneous variable.

The exposure in insurance ratemaking can be seen as a problem of censored data (in my dataset, the exposure is always smaller than 1 since observations are contracts, not policyholders),

- the number of claims on the period
- the number of claims on )

And as always, the variable of interest is the unobserved one, because we have to price insurance contract with a cover period of one (full) year. So we have to model the yearly frequency of insurance claims.

In our dataset, we have ‘s – or more generally also some additional covariates ‘s. For ratemaking, we need to estimate and perhaps also (for instance to test if the Poisson assumption is valid, or not). To estimate the expected value, a natural estimate for (forget about covariates as a start) is

which is also the weight average of annualized individual counts

We consider the ratio of the total number of claims to the total exposure-to-

risk. This estimate appears for instance if we consider a Poisson process, so that while . Then, the likelihood is

i.e.

The first order condition is here

which is satisfied if

So, we do have an estimator for the expected value, and a natural estimator for is then (if we consider categorical covariates)

Now, we need an estimate for the variance, or more precisely the conditional variable. Assume (as a starting point) that all have the same exposure . For instance, if is one half, insured were observed only the first six months. Then with ( is the number of claims on the first six months, while are the number of claims on the last six months), i.e. if we assume independent increments. I.e.

, or conversely . More generally, it is reasonable to assume that

for all values of . And then

Thus, it seems legitimate to assume that the empirical variance of can be written

Since the average of is , then

or equivalently

Thus, with different ‘s, it would be legitimate (I guess) to consider

Thus, an estimator foris

This can be used to test is the Poisson assumption is valid to model frequency. Consider the following dataset,

> sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt", + header=TRUE,sep=";") > sinistres=sinistre[sinistre$garantie=="1RC",] > sinistres=sinistres[sinistres$cout>0,] > contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt", + header=TRUE,sep=";") > T=table(sinistres$nocontrat) > T1=as.numeric(names(T)) > T2=as.numeric(T) > nombre1 = data.frame(nocontrat=T1,nbre=T2) > I = contrat$nocontrat%in%T1 > T1= contrat$nocontrat[I==FALSE] > nombre2 = data.frame(nocontrat=T1,nbre=0) > nombre=rbind(nombre1,nombre2) > baseFREQ = merge(contrat,nombre)

Here, we do have our two variables of interest, the exposure, per contract,

> E <- baseFREQ$exposition

and the (observed) number of claims (during that time frame)

> Y <- baseFREQ$nbre

It is possible to compute without covariates, the average (yearly) number of claims, per contract, and the associated variance

> (mean=weighted.mean(Y/E,E)) [1] 0.07279295 > (variance=sum((Y-mean*E)^2)/sum(E)) [1] 0.08778567

It looks like the variance is (slightly) larger than the average (we’ll see in a few weeks how to test it, more formally). It is possible to add covariates, for instance the density of population, in the area where the policyholder lives,

> X=as.factor(baseFREQ$densite) > for(i in 1:length(levels(X))){ + Ei=E[X==levels(X)[i]] + Yi=Y[X==levels(X)[i]] + (meani=weighted.mean(Yi/Ei,Ei)) # moyenne + (variancei=sum((Yi-meani*Ei)^2)/sum(Ei)) # variance + cat("Density, zone",levels(X)[i],"average =",meani," variance =",variancei,"\n") + } Density, zone 11 average = 0.07962411 variance = 0.08711477 Density, zone 21 average = 0.05294927 variance = 0.07378567 Density, zone 22 average = 0.09330982 variance = 0.09582698 Density, zone 23 average = 0.06918033 variance = 0.07641805 Density, zone 24 average = 0.06004009 variance = 0.06293811 Density, zone 25 average = 0.06577788 variance = 0.06726093 Density, zone 26 average = 0.0688496 variance = 0.07126078 Density, zone 31 average = 0.07725273 variance = 0.09067 Density, zone 41 average = 0.03649222 variance = 0.03914317 Density, zone 42 average = 0.08333333 variance = 0.1004027 Density, zone 43 average = 0.07304602 variance = 0.07209618 Density, zone 52 average = 0.06893741 variance = 0.07178091 Density, zone 53 average = 0.07725661 variance = 0.07811935 Density, zone 54 average = 0.07816105 variance = 0.08947993 Density, zone 72 average = 0.08579731 variance = 0.09693305 Density, zone 73 average = 0.04943033 variance = 0.04835521 Density, zone 74 average = 0.1188611 variance = 0.1221675 Density, zone 82 average = 0.09345635 variance = 0.09917425 Density, zone 83 average = 0.04299708 variance = 0.05259835 Density, zone 91 average = 0.07468126 variance = 0.3045718 Density, zone 93 average = 0.08197912 variance = 0.09350102 Density, zone 94 average = 0.03140971 variance = 0.04672329

Perhaps graphs would be a nice tool to play with, to visualize that information

> plot(meani,variancei,cex=sqrt(Ei),col="grey",pch=19, + xlab="Empirical average",ylab="Empirical variance") > points(meani,variancei,cex=sqrt(Ei))

The size of the circles is related to the size of the group (the area is proportional to the total exposure within the group). The first diagonal corresponds to the Poisson model, i.e. the variance should be equal to the mean. It is also possible to consider other covariates, like the gas type

or the car brand,

It is also possible to consider the age of the driver as a categorical variate

Actually, the age is interesting: we can observe on that dataset a feature that Jean-Philippe Boucher observed also on his own datasets. Let us look more carefully where are the different ages,

On the right, we can observe young (unexperienced) drivers. That was expected. But some classes are *below*the first diagonal: the expected frequency is large, but not the variance. I.e. we know *for sure* that young drivers have more car accidents. It is not an heterogeneous class, on the contrary: young drivers can be seen as a relatively homogeneous class, with a high frequency of car accidents.

With the original dataset (here, I use only a subset with 50,000 clients), we do obtain the following graph:

If we do not observe *underdispersion* for young drivers, observe that those are incredibly homogeneous classes. With a clear impact of experience, since circles are moving downward from age 18 to 25.

Another disturbing story (this was – one more time – suggestion from Jean-Philippe) that it might be possible to consider the exposure as a standard variable, and see if the coefficient is actually equal to 1. Without any covariate,

> reg=glm(Y~log(E),family=poisson("log")) > summary(reg) Call: glm(formula = Y ~ log(E), family = poisson("log")) Deviance Residuals: Min 1Q Median 3Q Max -0.3988 -0.3388 -0.2786 -0.1981 12.9036 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.83045 0.02822 -100.31 <2e-16 *** log(E) 0.53950 0.02905 18.57 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 12931 on 49999 degrees of freedom Residual deviance: 12475 on 49998 degrees of freedom AIC: 16150 Number of Fisher Scoring iterations: 6

i.e. the parameter is clearly strictly smaller than 1. And it is neither related to significance,

> library(car) > linearHypothesis(reg,"log(E)",1) Linear hypothesis test Hypothesis: log(E) = 1 Model 1: restricted model Model 2: Y ~ log(E) Res.Df Df Chisq Pr(>Chisq) 1 49999 2 49998 1 251.19 < 2.2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

nor to the fact that I did not take into account covariates,

> reg=glm(nbre~log(exposition)+carburant+as.factor(ageconducteur)+as.factor(densite),family=poisson("log"),data=baseFREQ) > summary(reg) Call: glm(formula = nbre ~ log(exposition) + carburant + as.factor(ageconducteur) + as.factor(densite), family = poisson("log"), data = baseFREQ) Deviance Residuals: Min 1Q Median 3Q Max -0.7114 -0.3200 -0.2637 -0.1896 12.7104 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -14.07321 181.04892 -0.078 0.938042 log(exposition) 0.56781 0.03029 18.744 < 2e-16 *** carburantE -0.17979 0.04630 -3.883 0.000103 *** as.factor(ageconducteur)19 12.18354 181.04915 0.067 0.946348 as.factor(ageconducteur)20 12.48752 181.04902 0.069 0.945011

(etc). So it might be a too strong assumption to assume that the exposure is an exogenous variate here. But that’s another story !

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Our Chief Data Scientist discusses the source of most headlines about AI failures here.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}