Over a million developers have joined DZone.

Fitting Models to Short Time Series

DZone's Guide to

Fitting Models to Short Time Series

· Big Data Zone ·
Free Resource

The Architect’s Guide to Big Data Application Performance. Get the Guide.

Fol­low­ing my post on fit­ting mod­els to long time series, I thought I’d tackle the oppo­site prob­lem, which is more com­mon in busi­ness environments.

I often get asked how few data points can be used to fit a time series model. As with almost all sam­ple size ques­tions, there is no easy answer. It depends on the num­ber of model para­me­ters to be esti­mated and the amount of ran­dom­ness in the data. The sam­ple size required increases with the num­ber of para­me­ters to be esti­mated, and the amount of noise in the data.

Using least squares esti­ma­tion, or some other non-​​regularized esti­ma­tion method, it is pos­si­ble to esti­mate a model only if you have more obser­va­tions than para­me­ters.  (If you use the LASSO, or some other reg­u­lar­iza­tion tech­nique, it is pos­si­ble to esti­mate a model with fewer obser­va­tions than para­me­ters.) How­ever, there is no guar­an­tee that a fit­ted model will be any good for fore­cast­ing, espe­cially when the data are noisy.

Some text­books pro­vide rules-​​of-​​thumb giv­ing min­i­mum sam­ple sizes for var­i­ous time series mod­els. These are mis­lead­ing and unsub­stan­ti­ated in the­ory or prac­tice. Fur­ther, they ignore the under­ly­ing vari­abil­ity of the data and often over­look the num­ber of para­me­ters to be esti­mated as well. There is, for exam­ple, no jus­ti­fi­ca­tion what­ever for the magic num­ber of 30 often given as a min­i­mum for ARIMA modelling.

The only rea­son­able approach is to first check that there are enough obser­va­tions to esti­mate the model, and then to test if the model per­forms well out-​​of-​​sample. With short series, there is not enough data to allow some obser­va­tions to be with­eld for test­ing pur­poses. How­ever, the AIC can be used as a proxy for the one-​​step fore­cast out-​​of-​​sample MSE (see here). The AIC allows both the num­ber of para­me­ters and the amount of noise to be taken into account.

What tends to hap­pen with short series is that the AIC sug­gests very sim­ple mod­els because any­thing with more than one or two para­me­ters will pro­duce poor fore­casts due to the esti­ma­tion error.  I applied the auto.arima() func­tion from the fore­cast pack­age in R to all the series from the M-​​competition with fewer than 20 obser­va­tions. There were a total of 144 series, of which 32 had mod­els with zero para­me­ters (ran­dom walks), 95 had mod­els with one para­me­ter, 15 had mod­els with two para­me­ters and 2 series had mod­els with three para­me­ters. For what it’s worth, here is the code.

n <- unlist(lapply(M1,function(x){length(x$x)}))
n <- n[n<20]
series <- names(n)
nparam <- numeric(length(n))
for(i in 1:length(n))
  fit <- auto.arima(M1[[series[i]]]$x)
  nparam[i] <- length(fit$coef)

Sea­sonal mod­els bring their own dif­fi­cul­ties because the sea­son­al­ity usu­ally takes up m-1 degrees of free­dom where m is the sea­sonal period (e.g., m=12 for monthly data). Fourier terms are one way to reduce the prob­lem — use­ful when­ever the ratio of m to sam­ple size is large. Fur­ther com­ments on sea­son­al­ity and sam­ple size are in my short Fore­sight paper with Andrey Kostenko: “Min­i­mum sam­ple size require­ments for sea­sonal fore­cast­ing mod­els”, although I wrote that for a sta­tis­ti­cally unso­phis­ti­cated audi­ence, so there is no men­tion of the LASSO or AIC as pos­si­ble solutions.

Learn how taking a DataOps approach will help you speed up processes and increase data quality by providing streamlined analytics pipelines via automation and testing. Learn More.


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}