Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

The Problem with Too Narrow Prediction Intervals

DZone's Guide to

The Problem with Too Narrow Prediction Intervals

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Almost all pre­dic­tion inter­vals from time series mod­els are too nar­row. This is a well-​​known phe­nom­e­non and arises because they do not account for all sources of uncer­tainty. In my 2002 IJF paper, we mea­sured the size of the prob­lem by com­put­ing the actual cov­er­age per­cent­age of the pre­dic­tion inter­vals on hold-​​out sam­ples. We found that for ETS mod­els, nom­i­nal 95% inter­vals may only pro­vide cov­er­age between 71% and 87%. The dif­fer­ence is due to miss­ing sources of uncertainty.

There are at least four sources of uncer­tainty in fore­cast­ing using time series models:

  1. The ran­dom error term;
  2. The para­me­ter estimates;
  3. The choice of model for the his­tor­i­cal data;
  4. The con­tin­u­a­tion of the his­tor­i­cal data gen­er­at­ing process into the future.

When we pro­duce pre­dic­tion inter­vals for time series mod­els, we gen­er­ally only take into account the first of these sources of uncer­tainty. It would be pos­si­ble to account for 2 and 3 using sim­u­la­tions, but that is almost never done because it would take too much time to com­pute. As com­put­ing speeds increase, it might become a viable approach in the future.

Even if we ignore the model uncer­tainty and the DGP uncer­tainty (sources 3 and 4), and just try to allow for para­me­ter uncer­tainty as well as the ran­dom error term (sources 1 and 2), there are no closed form solu­tions apart from some sim­ple spe­cial cases.

One such spe­cial case is an ARIMA(0,1,0) model with drift, which can be writ­ten as

where is a white noise process. In this case, it is easy to com­pute the uncer­tainty asso­ci­ated with the esti­mate of c, and then allow for it in the forecasts.

This model can be fit­ted using either the Arima func­tion or the rwf func­tion from the fore­cast pack­age for R. If the Arima func­tion is used, the uncer­tainty in c is ignored, but if the rwf func­tion is used, the uncer­tainty in c is included in the pre­dic­tion inter­vals. The dif­fer­ence can be seen in the fol­low­ing sim­u­lated example.

library(forecast)
 
set.seed(22)
x <-ts(cumsum(rnorm(50, -2.5, 4)))
 
RWD.x <- rwf(x,  h=40, drift=TRUE, level=95)
ARIMA.x <- Arima(x, c(0,1,0), include.drift=TRUE)
 
plot(forecast(ARIMA.x, h=40, level=95))
lines(RWD.x$lower, lty=2)
lines(RWD.x$upper, lty=2)

Hortonworks Community Connection (HCC) is an online collaboration destination for developers, DevOps, customers and partners to get answers to questions, collaborate on technical articles and share code examples from GitHub.  Join the discussion.

Topics:

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}