# Natura Non Facit Saltus

# Natura Non Facit Saltus

Join the DZone community and get the full member experience.

Join For FreeHortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

(see John Wilkins’ article on the – interesting – history of that phrase http://scienceblogs.com/evolvingthoughts/…). We will see several smoothing techniques for insurance ratemaking. As a starting point, assume that we do not want to use segmentation techniques: everyone will pay exactly the same price.

**no segmentation of the premium**

And that price should be related to the pure premium, which is proportional to the frequency (or the annualized frequency, as discussed previously), since

The probability measure is mentioned here just to recall that we can use any measure. Even (based on some covariates). Without any covariate, the expected frequency should be

> regglm0=glm(nbre~1+offset(log(exposition)),data=sinistres,family=poisson) > summary(regglm0) Call: glm(formula = nbre ~ 1 + offset(log(exposition)), family = poisson, data = sinistres) Deviance Residuals: Min 1Q Median 3Q Max -0.5033 -0.3719 -0.2588 -0.1376 13.2700 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.6201 0.0228 -114.9 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 12680 on 49999 degrees of freedom Residual deviance: 12680 on 49999 degrees of freedom AIC: 16353 Number of Fisher Scoring iterations: 6

> exp(coefficients(regglm0)) (Intercept) 0.07279295

Thus, if we do not want to take into account potential heterogeneity, we should assume that where is closed to 7.28%. Yes, as mentioned in class, it is rather common to see as a percentage, i.e. a probability, since

i.e. can be interpreted as the probability of not have a claim (see also the law of small numbers). Let us visualize this as a function of the age of the driver,

> a=18:100 > yp=predict(regglm0,newdata=data.frame(ageconducteur=a,exposition=1),type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey") > lines(a,yp1,lty=2) > lines(a,yp2,lty=2) > k=23 > points(a[k],yp0[k],pch=3,lwd=3,col="red") > segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

We do predict the same frequency for all drivers, e.g. for some drive aged 40,

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k]) Frequency = 0.07279295 confidence interval 0.07611196 0.06947393

Let us now consider the case where we try to take into account heterogeneity, e.g. by age,

**The (standard) Poisson regression**

The idea of the (log-)Poisson regression is to assume that instead of having , we should have , where

in a very general setting. Here, let us consider only one explanatory variable, i.e.

Here, we have

> yp=predict(regglm1,newdata=data.frame(ageconducteur=a,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey") > lines(a,yp1,lty=2) > lines(a,yp2,lty=2) > points(a[k],yp0[k],pch=3,lwd=3,col="red") > segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

i.e. the prediction for the annualized claim frequency for our 40 year old driver is now 7.74% (which is slightly higher than what we had before, 7.28%)

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k]) Frequency = 0.07740574 confidence interval 0.08117512 0.07363636

It is possible to compute not the expected frequency , but the ratio .

Above the horizontal blue line, the premium will be higher than the one obtained without segmentation, and (of course) lower below. Here, drivers younger than 44 year old will pay more, while driver older than 44 year old will be less. We have discussed, in the introduction, the necessity of segmentation. If we consider two companies, one segmenting, while the other one has a flat rate, then older drivers will go to the first company (since insurance is cheaper) while younger ones will go to the second one (again, it is cheaper). The problem is that the second company implicitly hopes that older drivers will compensate the risk. But since they’re gone, insurance will be too cheap, and the company will loose money (if not goes bankrupt). So companies have to use segmentation techniques to survive. Now, the problem is that we cannot be sure that this exponential decay of the premium is the proper way the premium should evolve as a function of the age. An alternative can be to use nonparametric techniques to visualize to *true* influence of the age on claims frequency.

**A pure nonparametric model**

A first model can be to consider a premium, per age. This can be done considering the age of the driver as a*factor* in the regression,

> regglm2=glm(nbre~as.factor(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson) > yp=predict(regglm2,newdata=data.frame(ageconducteur=a0,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a0,yp0,type="l",ylim=c(.03,.12)) > abline(v=40,col="grey")

Here, the forecast for our 40 year old driver is slightly lower than be previous one, but the confidence interval is much larger (since we focus on a very small subclass of the portfolio: drivers aged *exactly* 40)

Frequency = 0.06686658 confidence interval 0.08750205 0.0462311

Here, we consider too small classes, and the premium is too erratic: the premium will decrease of 20% from age 40 to 41, and then increase of 50% from age 41 to 42,

> diff(log(yp0[23:25])) 24 25 -0.2330241 0.5223478

There is no chance that the company will keep the insured with this strategy. This *discontinuity* of the premium is clearly an important issue here.

**Using age classes**

**An alternative can be to consider age classes, from very young drivers to senior drivers.**

> level1=seq(15,105,by=5) > regglmc1=glm(nbre~cut(ageconducteur,level1)+offset(log(exposition)), + data=sinistres,family=poisson) > summary(regglmc1) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.6036 0.1741 -9.212 < 2e-16 *** cut(ageconducteur, level1)(20,25] -0.4200 0.1948 -2.157 0.0310 * cut(ageconducteur, level1)(25,30] -0.9378 0.1903 -4.927 8.33e-07 *** cut(ageconducteur, level1)(30,35] -1.0030 0.1869 -5.367 8.02e-08 *** cut(ageconducteur, level1)(35,40] -1.0779 0.1866 -5.776 7.65e-09 *** cut(ageconducteur, level1)(40,45] -1.0264 0.1858 -5.526 3.28e-08 *** cut(ageconducteur, level1)(45,50] -0.9978 0.1856 -5.377 7.58e-08 *** cut(ageconducteur, level1)(50,55] -1.0137 0.1855 -5.464 4.65e-08 *** cut(ageconducteur, level1)(55,60] -1.2036 0.1939 -6.207 5.40e-10 *** cut(ageconducteur, level1)(60,65] -1.1411 0.2008 -5.684 1.31e-08 *** cut(ageconducteur, level1)(65,70] -1.2114 0.2085 -5.811 6.22e-09 *** cut(ageconducteur, level1)(70,75] -1.3285 0.2210 -6.012 1.83e-09 *** cut(ageconducteur, level1)(75,80] -0.9814 0.2271 -4.321 1.55e-05 *** cut(ageconducteur, level1)(80,85] -1.4782 0.3371 -4.385 1.16e-05 *** cut(ageconducteur, level1)(85,90] -1.2120 0.5294 -2.289 0.0221 * cut(ageconducteur, level1)(90,95] -0.9728 1.0150 -0.958 0.3379 cut(ageconducteur, level1)(95,100] -11.4694 144.2817 -0.079 0.9366 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > yp=predict(regglmc1,newdata=data.frame(ageconducteur=a,exposition=1), + type="response",se.fit=TRUE) > yp0=yp$fit > yp1=yp$fit+2*yp$se.fit > yp2=yp$fit-2*yp$se.fit > plot(a,yp0,ylim=c(.03,.12),type="s") > abline(v=40,col="grey") > lines(a,yp1,lty=2,type="s") > lines(a,yp2,lty=2,type="s")

**Here we obtain the following predictions,**

**and for our 40 year old driver, the frequency is now 6.84%.**

Frequency = 0.0684573 confidence interval 0.07766717 0.05924742

**But our classes were defined arbitrarily here. Perhaps should we consider other classes, to see if the prediction is sensitive to the cutting values,**

> level2=level1-2 > regglmc2=glm(nbre~cut(ageconducteur,level2)+offset(log(exposition)), + data=sinistres,family=poisson)

**which yields the following values for our 40 year old driver,**

Frequency = 0.07050614 confidence interval 0.07980422 0.06120807

**So here, we did not remove the discontinuity problem. An idea here can be to consider moving regions: if the goal is to predict the frequency for a 40 year old driver, perhaps the class should be (somehow) centered around 40. And center the interval around 35 for drivers aged 35. Etc.**

**Moving average**

**Thus, it is natural to consider some local regressions, where only drivers aged almost 40 should be considered. This almost concept is related to the bandwidth. For instance, drivers between 35 and 45 can be considered as being almost40. In practice we can either consider a subset function, or we can use weights in the regressions**

> value=40 > h=5 > sinistres$omega=(abs(sinistres$ageconducteur-value)<=h)*1 > regglmomega=glm(nbre~ageconducteur+offset(log(exposition)), + data=sinistres,family=poisson,weights=omega)

**To see what’s going on, let us consider an animated plot, where the age of interest is changing,**

**Here, for our 40 year old drive, we get**

Frequency = 0.06913391 confidence interval 0.07535564 0.06291218

**We do obtain a curve that can be interpreted as a local regression. But here, we do not take into account that 35 is not as close to 40 as 39 could be. An here, 34 is assumed to be very far away from 40. Clearly, we could improve that technique: kernel functions can considered, i.e. the closer to 40, the larger the weight.**

> value=40 > h=5 > sinistres$omega=dnorm(abs(sinistres$ageconducteur-value)/h) > regglmomega=glm(nbre~ageconducteur+offset(log(exposition)), + data=sinistres,family=poisson,weights=omega)

**which can be plotted below**

**Here, our prediction for our 40 year old drive is**

Frequency = 0.07040464 confidence interval 0.07981521 0.06099408

**This is the idea of kernel regression techniques. But as explained in the slides, other non parametric techniques can be considered, like spline functions.**

**Smoothing with splines**

**In R, it is simple to use spline function (somehow much more simple than kernel smoothers)**

> library(splines) > regglmbs=glm(nbre~bs(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson)

**The prediction for our 40 year old driver is now**

Frequency = 0.06928169 confidence interval 0.07397124 0.06459215

**Note that this techniques is related to another class of models, the so-called Generalized Additive Models, i.e. GAMs.**

> library(mgcv) > reggam=gam(nbre~s(ageconducteur)+offset(log(exposition)), + data=sinistres,family=poisson)

**The prediction is extremely close to the one we obtained above (the main differences being observed for very old drivers)**

Frequency = 0.06912683 confidence interval 0.07501663 0.06323702

**Comparison of the different models**

**Somehow, one way or another, all those models are valid. So perhaps we should compare them,**

**On the graph above, we can visualize the upper and the lower bound of the prediction, for the 9 models. The horizontal line is the predicted value without taking into account heterogeneity. It is possible to consider relative values, with respect to this value,**

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub. Join the discussion.

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}