Over a million developers have joined DZone.
Platinum Partner

Batch forecasting in R

· Performance Zone

The Performance Zone is brought to you in partnership with AppDynamics. Discover five of the top performance metrics to capture to assess the health of your enterprise Java application.

I some­times get asked about fore­cast­ing many time series auto­mat­i­cally. Here is a recent email, for example:

I have looked but can­not find any info on gen­er­at­ing fore­casts on mul­ti­ple data sets in sequence. I have been using analy­sis ser­vices for sql server to gen­er­ate fit­ted time series but it is too much of a black box (or I don’t know enough to tweak/​manage the inputs). In short, what pack­age should I research that will allow me to load data, gen­er­ate a fore­cast (pre­sum­ably best fit), export the fore­cast then repeat for a few thou­sand items. I have read that R does not like ‘loops’ but not sure if the cur­rent cpu power off­sets that or not. Any guid­ance would be greatly appre­ci­ated. Thank you!!

My response

Loops are fine in R. They are frowned upon because peo­ple use them inap­pro­pri­ately when there are often much more effi­cient vec­tor­ized ver­sions avail­able. But for this task, a loop is the only approach.

Read­ing data and export­ing fore­casts is stan­dard R and does not require any addi­tional pack­ages to load. To gen­er­ate the fore­casts, use the fore­cast pack­age. Either the ets() func­tion or the auto.arima() func­tion depend­ing on what type of data you are mod­el­ling. If it’s high fre­quency data (fre­quency greater than 24) than you would need the tbats() func­tion but that is very slow.

Some sam­ple code

In the fol­low­ing exam­ple, there are many columns of monthly data in a csv file with the first col­umn con­tain­ing the month of obser­va­tion (begin­ning with April 1982). Fore­casts have been gen­er­ated by apply­ing forecast() directly to each time series. That will select an ETS model using the AIC, esti­mate the para­me­ters, and gen­er­ate fore­casts. Although it returns pre­dic­tion inter­vals, in the fol­low­ing code, I’ve sim­ply extracted the point fore­casts (named mean in the returned fore­cast object because they are usu­ally the mean of the fore­cast distribution).

library(forecast)
 
retail <- read.csv("http://robjhyndman.com/data/ausretail.csv",header=FALSE)
retail <- ts(retail[,-1],f=12,s=1982+3/12)
 
ns <- ncol(retail)
h <- 24
fcast <- matrix(NA,nrow=h,ncol=ns)
for(i in 1:ns)
  fcast[,i] <- forecast(retail[,i],h=h)$mean
 
write(t(fcast),file="retailfcasts.csv",sep=",",ncol=ncol(fcast))

Note that the trans­pose of the fcast matrix is used in write() because the file is writ­ten row-​​by-​​row rather than column-​​by-​​column.

This code does not actu­ally do what the ques­tioner asked as I am writ­ing all fore­casts at once rather than export­ing them at each iter­a­tion. The lat­ter is much less efficient.

If ns is large, this could prob­a­bly be more effi­ciently coded using the par­al­lel pack­age.

The Performance Zone is brought to you in partnership with AppDynamics.  Learn the essentials of APM and how to implement best practices of application performance, better understand what it means to capture, analyze, and react to performance problems as they arise, and more specifically with .NET applications. 

Topics:

Published at DZone with permission of Rob J Hyndman , DZone MVB .

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}