DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
  1. DZone
  2. Data Engineering
  3. Data
  4. Modeling Individual Losses with Mixtures

Modeling Individual Losses with Mixtures

Arthur Charpentier user avatar by
Arthur Charpentier
·
Feb. 23, 13 · Interview
Like (0)
Save
Tweet
Share
2.48K Views

Join the DZone community and get the full member experience.

Join For Free

Usually, the sentence that I keep saying in my regression classes is “please, look at your data“. In our previous post, we’ve been playing like most econometricians: we did not look at the data. Actually, if we look at the distribution of individual losses, in the dataset, we see the following,

> n=nrow(couts)
> plot(sort(couts$cout),(1:n)/(n+1),xlim=c(0,10000),type="s",lwd=2,col="green")

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.10.26.png

It looks like there are fixed costs claims in our database. How do we deal with it in the standard case (e.g. in Loss Models textbook) ? We can use a mixture of – at least – three distributions here,

with

  • a distribution for small claims, http://latex.codecogs.com/gif.latex?{\color{Blue}%20f_1(}\cdot{\color{Blue}%20)}, e.g. an exponential distribution
  • a Dirac mass in http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\kappa}, i.e. http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\delta_{\kappa}(}\cdot{\color{Magenta}%20)}
  • a distribution for larger claims, http://latex.codecogs.com/gif.latex?{\color{Red}%20f_3(}\cdot{\color{Red}%20)}, e.g. a Gamma, or a lognormal, distribution
>  I1=which(couts$cout<1120)
>  I2=which((couts$cout>=1120)&(couts$cout<1220))
>  I3=which(couts$cout>=1220)
>  (p1=length(I1)/nrow(couts))
[1] 0.3284823
>  (p2=length(I2)/nrow(couts))
[1] 0.4152807
>  (p3=length(I3)/nrow(couts))
[1] 0.256237
>  X=couts$cout
>  (kappa=mean(X[I2]))
[1] 1171.998
>  X0=X[I3]-kappa
>  u=seq(0,10000,by=20)
>  F1=pexp(u,1/mean(X[I1]))
>  F2= (u>kappa)
>  F3=plnorm(u-kappa,mean(log(X0)),sd(log(X0))) * (u>kappa)
>  F=F1*p1+F2*p2+F3*p3
>  lines(u,F)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.13.43.png

In our previous post, we’ve discussed the idea that all parameters might be related to some covariates, i.e.

http://latex.codecogs.com/gif.latex?f(y|\boldsymbol{X})%20=%20p_1(\boldsymbol{X})%20{\color{Blue}%20f_1(}y|\boldsymbol{X}{\color{Blue}%20)}%20+%20p_2(\boldsymbol{X})%20{\color{Magenta}%20\delta_{\kappa}(}y{\color{Magenta}%20)}%20+%20p_3(\boldsymbol{X})%20{\color{Red}%20f_3(}y|\boldsymbol{X}{\color{Red}%20)}

which yield the following premium model,

http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s_1)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s_1|\boldsymbol{X})}_{D}}}}\\+{\color{Purple}%20{{\underbrace{\mathbb{E}(Y|Y\in(%20s_1,s_2],%20\boldsymbol{X})%20}_{B}}\cdot%20{\underbrace{\mathbb{P}(Y\in(%20s_1,s_2]|%20\boldsymbol{X})}_{D}}}}\\+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s_2,%20\boldsymbol{X})%20}_{C}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s_2|%20\boldsymbol{X})}_{D}}}}

For the http://latex.codecogs.com/gif.latex?{\color{Blue}%20A}, http://latex.codecogs.com/gif.latex?{\color{Magenta}%20B} and http://latex.codecogs.com/gif.latex?{\color{Red}%20C} terms, that’s easy, we can use standard models we’ve seen in the course. For the probability, we should use a multinomial model. Recall that for the logistic regression model, if http://latex.codecogs.com/gif.latex?(\pi,1-\pi)=(\pi_1,\pi_2), then

http://latex.codecogs.com/gif.latex?\log%20\frac{\pi}{1-\pi}=\log%20\frac{\pi_1}{\pi_2}%20=\boldsymbol{X}%27\boldsymbol{\beta}

i.e.

http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta})}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}

and

http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}

To derive a multivariate extension, write

http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

and

http://latex.codecogs.com/gif.latex?\pi_3%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}

Again, maximum likelihood techniques can be used, since

http://latex.codecogs.com/gif.latex?\mathcal{L}(\boldsymbol{\pi},\boldsymbol{y})\propto%20\prod_{i=1}^n%20\prod_{j=1}^3%20\pi_{i,j}^{Y_{i,j}}

where here, variable http://latex.codecogs.com/gif.latex?Y_{i}  – which take three levels – is splitted in three indicators (like any categorical explanatory variables in standard regression model). Thus,

http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}(\boldsymbol{\beta},\boldsymbol{y})\propto%20\sum_{i=1}^n%20\sum_{j=1}^2%20\left(Y_{i,j}%20\boldsymbol{X}_i%27\boldsymbol{\beta}_j\right)%20-%20n_i\log\left[1+1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)\right]

and, as for the logistic regression, then use Newton Raphson’ algorithm to compute numerically the maximum likelihood. In R, first we have to define the levels, e.g.

> seuils=c(0,1120,1220,1e+12)
> couts$tranches=cut(couts$cout,breaks=seuils,
+ labels=c("small","fixed","large"))
> head(couts,5)
  nocontrat    no garantie    cout exposition zone puissance agevehicule
1      1870 17219      1RC 1692.29       0.11    C         5           0
2      1963 16336      1RC  422.05       0.10    E         9           0
3      4263 17089      1RC  549.21       0.65    C        10           7
4      5181 17801      1RC  191.15       0.57    D         5           2
5      6375 17485      1RC 2031.77       0.47    B         7           4
  ageconducteur bonus marque carburant densite region tranches
1            52    50     12         E      73     13    large
2            78    50     12         E      72     13    small
3            27    76     12         D      52      5    small
4            26   100     12         D      83      0    small
5            46    50      6         E      11     13    large

Then, we can run a multinomial regression, from

> library(nnet)

using some selected covariates

> reg=multinom(tranches~ageconducteur+agevehicule+zone+carburant,data=couts)
# weights:  30 (18 variable)
initial  value 2113.730043 
iter  10 value 2063.326526
iter  20 value 2059.206691
final  value 2059.134802 
converged

The output is here

> summary(reg)
Call:
multinom(formula = tranches ~ ageconducteur + agevehicule + zone + 
    carburant, data = couts)

Coefficients:
      (Intercept) ageconducteur agevehicule      zoneB      zoneC
fixed  -0.2779176   0.012071029  0.01768260 0.05567183 -0.2126045
large  -0.7029836   0.008581459 -0.01426202 0.07608382  0.1007513
           zoneD      zoneE      zoneF   carburantE
fixed -0.1548064 -0.2000597 -0.8441011 -0.009224715
large  0.3434686  0.1803350 -0.1969320  0.039414682

Std. Errors:
      (Intercept) ageconducteur agevehicule     zoneB     zoneC     zoneD
fixed   0.2371936   0.003738456  0.01013892 0.2259144 0.1776762 0.1838344
large   0.2753840   0.004203217  0.01189342 0.2746457 0.2122819 0.2151504
          zoneE     zoneF carburantE
fixed 0.1830139 0.3377169  0.1106009
large 0.2160268 0.3624900  0.1243560

To visualize the impact of a covariate (one, only), one can use also spline functions

> library(splines)
> reg=multinom(tranches~agevehicule,data=couts)
# weights:  9 (4 variable)
initial  value 2113.730043 
final  value 2072.462863 
converged
> reg=multinom(tranches~bs(agevehicule),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2070.496939
iter  20 value 2069.787720
iter  30 value 2069.659958
final  value 2069.479535 
converged

For instance, if the covariate is the age of the car, we do have the following probabilities

> predict(reg,newdata=data.frame(agevehicule=5),type="probs")
    small     fixed     large 
0.3388947 0.3869228 0.2741825

and for all ages from 0 to 20,

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.02.55.png

For instance, for new cars, the proportion of fixed costs is rather small (here in purple), and keeps increasing with the age of the car. If the covariate is the density of population in the area the driver lives, we do obtain the following probabilities

> reg=multinom(tranches~bs(densite),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2068.469825
final  value 2068.466349 
converged
> predict(reg,newdata=data.frame(densite=90),type="probs")
    small     fixed     large 
0.3484422 0.3473315 0.3042263

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-13-a%CC%80-16.05.29.png

Based on those probabilities, it is then possible to derive the expected cost of a claims, given some covariates (e.g. the density). But first, define subsets of the whole dataset

> sbaseA=couts[couts$tranches=="small",]
> sbaseB=couts[couts$tranches=="fixed",]
> sbaseC=couts[couts$tranches=="large",]

with a threshold given by

> (k=mean(sousbaseB$cout))
[1] 1171.998

Then, let us run our four models,

> reg=multinom(tranches~bs(densite),data=couts)
> regA=glm(cout~bs(densite),data=sousbaseA,family=Gamma(link="log"))
> regB=glm(cout~1,data=sousbaseB,family=Gamma(link="log"))
> regC=glm((cout-k)~bs(densite),data=sousbaseC,family=Gamma(link="log"))

We can now compute predictions based on those models,

> nouveau=data.frame(densite=seq(10,100))
> proba=predict(reg,newdata=nouveau,type="probs")
> predA=predict(regA,newdata=nouveau,type="response")
> predB=predict(regB,newdata=nouveau,type="response")
> predC=predict(regC,newdata=nouveau,type="response")+k
> pred=cbind(predA,predB,predC)

To visualize the impact of each component on the premium, we can compute probabilities, are well as expected costs (given a cost in each subset),

> cbind(proba,pred)[seq(10,90,by=10),]
       small     fixed     large    predA    predB    predC
10 0.3344014 0.4241790 0.2414196 423.3746 1171.998 7135.904
20 0.3181240 0.4471869 0.2346892 428.2537 1171.998 6451.890
30 0.3076710 0.4626572 0.2296718 438.5509 1171.998 5499.030
40 0.3032872 0.4683247 0.2283881 451.4457 1171.998 4615.051
50 0.3052378 0.4620219 0.2327404 463.8545 1171.998 3961.994
60 0.3136136 0.4417057 0.2446807 472.3596 1171.998 3586.833
70 0.3279413 0.4056971 0.2663616 473.3719 1171.998 3513.601
80 0.3464842 0.3534126 0.3001032 463.5483 1171.998 3840.078
90 0.3652932 0.2868006 0.3479061 440.4925 1171.998 4912.379

Now, it is possible to plot those figures in a graph,

> barplot(t(proba*pred))
> abline(h=mean(couts$cout),lty=2)

http://f.hypotheses.org/wp-content/blogs.dir/253/files/2013/02/Capture-d%E2%80%99e%CC%81cran-2013-02-15-a%CC%80-11.50.47.png

(the dotted horizontal line is the average cost of a claim, in our dataset).

Distribution (differential geometry) Maximum likelihood Database Derive (computer algebra system) POST (HTTP) Indicator (metadata) Subset Data (computing) Graph (Unix) Newton (app)

Published at DZone with permission of Arthur Charpentier, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Load Balancing Pattern
  • Distributed SQL: An Alternative to Database Sharding
  • Real-Time Stream Processing With Hazelcast and StreamNative
  • What Is a Kubernetes CI/CD Pipeline?

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: