A few months ago, I did publish a (long) post entitled "some thoughts on economics, mathematics, econometrics, machine learning, etc". In that post, I was discussing possible differences between foundations of econometrics, and machine learning. I wanted to get back today on an important point, related to training/sampling datasets, when we have temporal data.

I was discussing this morning, with a student of the *Data Science for Actuaries* program, an interesting point related to claim frequency models, for insurance ratemaking. Since the goal is to predict claims frequency (to assess the level of the insurance premium), he suggested to use old data to train the model, and more recent one to test it. The problem is that the model did not incorporate any temporal pattern, and we got surprising results.

Consider here a simple dataset:

```
> set.seed(1)
> n=50000
> X1=runif(n)
> T=sample(2000:2015,size=n,replace=TRUE)
> L=exp(-3+X1-(T-2000)/20)
> E=rbeta(n,5,1)
> Y=rpois(n,L*E)
> B=data.frame(Y,X1,L,T,E)
```

Claims frequency is driven by a Poisson process, with one covariate, X1, and we assume that the intensity decreases (with an exponential rate). Consider here a standard linear regression, without any time effect:

```
> reg=glm(Y~X1+offset(log(E)),data=B,
+ family=poisson)
```

We can also compute the empirical annualized claims frequency:

```
> u=seq(0,1,by=.01)
> v=predict(reg,newdata=data.frame(X1=u,E=1))
> p=function(x){
+ B=B[abs(B$X1-x)<.1,]
+ sum(B$Y)/sum(B$E)
+ }
> vp=Vectorize(p)(seq(.05,.95,by=.1))
```

And plot the two curves on the same graph:

```
> plot(seq(.05,.95,by=.1),vp,type="b")
> lines(u,exp(v),lty=2,col="red")
```

This is what we usually do in econometrics. In machine learning, more specifically to assess the quality of the model, and for model selection, it is common to split the dataset in two parts: a training sample and a validation sample. Consider some randomized training/validation samples, then fit a model on the training sample, and finally use it to get a prediction:

```
> idx=sample(1:nrow(B),size=nrow(B)*7/8)
> B_a=B[idx,]
> B_t=B[-idx,]
> reg=glm(Y~X1+offset(log(E)),data=B_a,
+ family=poisson)
> u=seq(0,1,by=.01)
> v=predict(reg,newdata=data.frame(X1=u,E=1))
> p=function(x){
+ B=B_a[abs(B_a$X1-x)<.1,]
+ sum(B$Y)/sum(B$E)
+ }
> vp_a=Vectorize(p)(seq(.05,.95,by=.1))
> plot(seq(.05,.95,by=.1),vp_a,col="blue")
> lines(u,exp(v),lty=2)
> p=function(x){
+ B=B_t[abs(B_t$X1-x)<.1,]
+ sum(B$Y)/sum(B$E)
+ }
> vp_t=Vectorize(p)(seq(.05,.95,by=.1))
> lines(seq(.05,.95,by=.1),vp_t,col="red")
```

The blue curve is the prediction on the training sample (as we usually do in econometrics), but then the red curve is the prediction on the testing sample. Here, volatility probably comes from the small size of the testing sample (1 observation out of 8).

Now, what if we use the year as a splitting criteria? We fit a model on old years to fit a model, and we test it on recent years:

```
> B_a=subset(B,T<2014)
> B_t=subset(B,T>=2014)
> reg=glm(Y~X1+offset(log(E)),data=B_a,family=poisson)
> u=seq(0,1,by=.01)
> v=predict(reg,newdata=data.frame(X1=u,E=1))
> p=function(x){
+ B=B_a[abs(B_a$X1-x)<.1,]
+ sum(B$Y)/sum(B$E)
+ }
> vp_a=Vectorize(p)(seq(.05,.95,by=.1))
> plot(seq(.05,.95,by=.1),vp_a,col="blue")
> lines(u,exp(v),lty=2)
> p=function(x){
+ B=B_t[abs(B_t$X1-x)<.1,]
+ sum(B$Y)/sum(B$E)
+ }
> vp_t=Vectorize(p)(seq(.05,.95,by=.1))
> lines(seq(.05,.95,by=.1),vp_t,col="red")
```

Clearly we missed something here…

We were looking at such a graph this morning, and it took me some time to understand how training and validation samples were designed, and that there was a possible temporal effect (actually, this morning, it was based on a 3 year training sample and a 1 year validation sample).

Since there is a temporal pattern, let us capture it. As an econometrician, let me use a regression model:

```
> reg=glm(Y~X1+T+offset(log(E)),data=B,
+ family=poisson)
> C=coefficients(reg)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3)
> plot(2000:2015,exp(C[1]+C[3]*(2000:2015)))
> lines(u,v,lty=2,col="red")
```

(I focus only on the evolution of the temporal variate on that graph).

Here, we use a linear model, but there are usually no reason to assume linearity. So we might consider splines

```
> library(splines)
> reg=glm(Y~X1+bs(T)+offset(log(E)),
+ data=B,family=poisson)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3)
> v2=predict(reg,newdata=data.frame(X1=0,
+ T=2000:2015,E=1))
> plot(2000:2015,exp(v2),type="b")
> lines(u,v,lty=2,col="red")
```

But here again, why should we assume that there is an underlying smooth function? There might be some ruptures… so let us consider a regression on factors:

```
> reg=glm(Y~0+X1+as.factor(T)+offset(log(E)),
+ data=B,family=poisson)
> C=coefficients(reg)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3)
> plot(2000:2015,exp(C[2:17]),type="b")
> lines(u,v,lty=2,col="red")
```

An alternative might be to consider a more general model, like a regression tree:

```
> library(rpart)
> reg=rpart(Y~X1+T+offset(log(E)),data=B,
+ method="poisson",cp=1e-4)
> p=function(t){
+ B=B[B$T==t,]
+ B$E=1
+ mean(predict(reg,newdata=B))
+ }
> y_m=Vectorize(function(t) p(t))(2000:2015)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3+.5)
> plot(2000:2015,y_m,ylim=c(.02,.085),type="b")
> lines(u,v,lty=2,col="red")
```

Here, it seems that something went wrong. I guess it’s coming from the exposure. So consider a simplier model, on the annualized frequency, and with weights that are related to the exposure:

```
> reg=rpart(Y/E~X1+T,data=B,weights=B$E,cp=1e-4)
> p=function(t){
+ B=B[B$T==t,]
+ B$E=1
+ mean(predict(reg,newdata=B))
+ }
> y_m=Vectorize(function(t) p(t))(2000:2015)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3+.5)
> plot(2000:2015,y_m,ylim=c(.02,.085),type="b")
> lines(u,v,lty=2,col="red")
```

That was for the econometrician perspective. With a machine learning perspective, consider a training sample (here based on old data) and a validation sample (based on more recent ones):

```
> B_a=subset(B,T<2014)
> B_t=subset(B,T>=2014)
```

If we consider a model, it is easy to get a prediction on recent years, even if the model was designed to model older ones:

```
> reg_a=glm(Y~X1+T+offset(log(E)),
+ data=B_a,family=poisson)
> C=coefficients(reg_a)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3)
> plot(2000:2015,exp(C[1]+C[3]*c(2000:2013,
+ NA,NA)),type="b")
> lines(u,v,lty=2,col="red")
> points(2014:2015,exp(C[1]+C[3]*2014:2015),
+ pch=19,col="blue")
```

But if we use years as factors, things are more complicated:

```
> reg_a=glm(Y~0+X1+as.factor(T)+offset(log(E)),
+ data=B_a,family=poisson)
> C=coefficients(reg_a)
> RMSE=function(A){
+ L=exp(C[1]*B_t$X1+ A[1]*(B_t$T==2014) + A[2]*(B_t$T==2015))
+ Y_t=L*B_t$E
+ sum( (Y_t - B_t$Y )^2)}
> i=optim(c(.4,.4),RMSE)$par
> plot(2000:2015,c(exp(C[2:15]),NA,NA),)
> u=seq(1999,2016,by=.1)
> v=exp(-(u-2000)/20-3)
> lines(u,v,lty=2,col="red")
> points(2014:2015,exp(i),pch=19,col="blue")
```

Because we need to get a prediction on levels that were not in our training sample. Here, we minimize the RMSE to quantify factor levels for recent years, and the output is not that bad.

So yes, it is possible to get a training dataset on older data, and test it on recent years. But one should be careful, and take into account, temporal patterns.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}