Over a million developers have joined DZone.

Visualization of probabilistic forecasts

· Big Data Zone

Hortonworks DataFlow is an integrated platform that makes data ingestion fast, easy, and secure. Download the white paper now.  Brought to you in partnership with Hortonworks

This week my research group dis­cussed Adrian Raftery’s recent paper on “Use and Com­mu­ni­ca­tion of Prob­a­bilis­tic Fore­casts” which pro­vides a fas­ci­nat­ing but brief sur­vey of some of his work on mod­el­ling and com­mu­ni­cat­ing uncer­tain futures. Coin­ci­den­tally, today I was also sent a copy of David Spiegelhalter’s paper on “Visu­al­iz­ing Uncer­tainty About the Future”. Both are well-​​worth reading.

It made me think about my own efforts to com­mu­ni­cate future uncer­tainty through graph­ics. Of course, for time series fore­casts I nor­mally show pre­dic­tion inter­vals. I pre­fer to use more than one inter­val at a time because it helps con­vey a lit­tle more infor­ma­tion. The default in the fore­cast pack­age for R is to show both an 80% and a 95% inter­val like this:

It is some­times prefer­able to use a 50% and a 95% inter­val, rather like a box­plot:

In some cir­cles (espe­cially macro­eco­nomic fore­cast­ing), fan charts are pop­u­lar:

Per­son­ally, I don’t like these at all as they lose any spe­cific prob­a­bilis­tic inter­pretabil­ity. What does the darker shaded region actu­ally refer to? At least in the pre­vi­ous ver­sion, it is clear that the dark region con­tains 50% of the probability.

The above three exam­ples are eas­ily pro­duced using the fore­cast package:

1.fit <- ets(hsales)

For multi-​​modal dis­tri­b­u­tions I like to use high­est den­sity regions. Here is an exam­ple applied to Nicholson’s blowfly data using a thresh­old model:

The dark region has 50% cov­er­age and the light region has 95% cov­er­age. The fore­cast dis­tri­b­u­tions become bimodal after the first ten iter­a­tions, and so the 50% region is split in two to show that. This graphic was taken from a J. Fore­cast­ing paper I wrote in 1996, so these ideas have been around for a while!

It is easy enough to pro­duce fore­cast HDR with time series. Here is some R code to do it:

01.#HDR for time series object
02.# Assumes that forecasts can be computed and futures simulated from object
03.forecasthdr <- function(object, h = ifelse(object$m > 1, 2 * object$m, 10),
04.                        nsim=2000, plot=TRUE, level=c(50,95), xlim=NULL, ylim=NULL, ...)
06.  require(hdrcde)
07.  # Compute forecasts
08.  fc <- forecast(object)
09.  ft <- time(fc$mean)
11.  # Simulate future sample paths
12.  sim <- matrix(0,nrow=h,ncol=nsim)
13.  for(i in 1:nsim)
14.    sim[,i] <- simulate(object, nsim=h)
16.  # Compute HDRs
17.  nl <- length(level)
18.  hd <- array(0, c(h,nl,10))
19.  mode <- numeric(h)
20.  for(k in 1:h)
21.  {
22.    hdtmp <- hdr(sim[k,], prob=level)
23.    hd[k,,1:ncol(hdtmp$hdr)] <- hdtmp$hdr
24.    mode[k] <- hdtmp$mode
25.  }
26.  # Remove unnecessary sections of HDRs
27.  nz <- apply(abs(hd),3,sum) > 0
28.  hd <- hd[,,nz]
29.  dimnames(hd)[[1]] <- 1:h
30.  dimnames(hd)[[2]] <- level
33.  # Produce plot if required
34.  if(plot)
35.  {
36.    if(is.<span>null</span>(xlim))
37.      xlim <- range(time(object$x),ft)
38.    if(is.<span>null</span>(ylim))
39.      ylim <- range(object$x, hd)
40.    plot(object$x,xlim=xlim, ylim=ylim, ...)
41.    # Show HDRs
42.    cols <- rev(colorspace::sequential_hcl(52))[level - 49]
43.    for(k in 1:h)
44.    {
45.      for(j in 1:nl)
46.      {
47.        hdtmp <- hd[k,j,]
48.        nint <- length(hdtmp)/2
49.        for(l in 1:nint)
50.        {
51.          polygon(ft[k]+c(-1,-1,1,1)/object$m/2,
52.                  c(hdtmp[2*l-1],hdtmp[2*l],hdtmp[2*l],hdtmp[2*l-1]),
53.                col=cols[j], border=FALSE)
54.        }
55.      }
56.      points(ft[k], mode[k], pch=19, col="blue",cex=0.8)
57.    }
58.    #lines(fc$mean,col='blue',lwd=2)
59.  }
61.  # Return HDRs
62.  return(list(hdr=hd,mode=mode,level=level))

We can apply it using the exam­ple I started with:

1.z <- forecasthdr(fit,xlim=c(1986,1998),nsim=5000,
2.              xlab="Year",ylab="US monthly housing sales")

The dots are modes of the fore­cast dis­tri­b­u­tions, and the 50% and 95% high­est den­sity regions are also shown. In this case, the dis­tri­b­u­tions are uni­modal, and so all the regions are inter­vals.

Learn how you can modernize your data warehouse with Apache Hadoop. View an on-demand webinar now. Brought to you in partnership with Hortonworks.


Published at DZone with permission of Rob J Hyndman, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}