Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Measuring Time Series Characteristics

DZone's Guide to

Measuring Time Series Characteristics

· Big Data Zone
Free Resource

Effortlessly power IoT, predictive analytics, and machine learning applications with an elastic, resilient data infrastructure. Learn how with Mesosphere DC/OS.

A few years ago, I was work­ing on a project where we mea­sured var­i­ous char­ac­ter­is­tics of a time series and used the infor­ma­tion to deter­mine what fore­cast­ing method to apply or how to clus­ter the time series into mean­ing­ful groups. The two main papers to come out of that project were:

  1. Wang, Smith and Hyn­d­man (2006) Characteristic-​​​​based clus­ter­ing for time series data. Data Min­ing and Knowl­edge Dis­cov­ery, 13(3), 335–364.
  2. Wang, Smith-​​Miles and Hyn­d­man (2009) “Rule induc­tion for fore­cast­ing method selec­tion: meta-​​​​learning the char­ac­ter­is­tics of uni­vari­ate time series”, Neu­ro­com­puting, 72, 2581–2594.

I’ve since had a lot of requests for the code which one of my coau­thors has been help­fully email­ing to any­one who asked. But to make it eas­ier, we thought it might be help­ful if I post some updated code here. This is not the same as the R code we used in the paper, as I’ve improved it in sev­eral ways (so it will give dif­fer­ent results). If you just want the code, skip to the bot­tom of the post.

Find­ing the period of the data

Usu­ally in time series work, we know the period of the data (if the obser­va­tions are monthly, the period is 12, for exam­ple). But in this project, some of our data was of unknown period and we wanted a method to auto­mat­i­cally deter­mine the appro­pri­ate period. The method we used was based on local peaks and troughs in the ACF. But I’ve since devised a bet­ter approach (prompted on cross​val​i​dated​.com) using an esti­mate of the spec­tral density:

find.freq <- function(x)
{
  n <- length(x)
  spec <- spec.ar(c(na.contiguous(x)),plot=FALSE)
  if(max(spec$spec)>10) # Arbitrary threshold chosen by trial and error.
  {
    period <- round(1/spec$freq[which.max(spec$spec)])
    if(period==Inf) # Find next local maximum
    {
      j <- which(diff(spec$spec)>0)
      if(length(j)>0)
      {
        nextmax <- j[1] + which.max(spec$spec[j[1]:500])
        if(nextmax <= length(spec$freq))
          period <- round(1/spec$freq[nextmax])
        else
          period <- 1
      }
      else
        period <- 1
    }
  }
  else
    period <- 1
 
  return(period)
}

The func­tion is called find.freq because time series peo­ple often call the period of sea­son­al­ity the “fre­quency” (which is of course highly confusing).

Decom­pos­ing the data into trend and sea­sonal components

We needed a mea­sure of the strength of trend and the strength of sea­son­al­ity, and to do this we decom­posed the data into trend, sea­sonal and error terms.

Because not all data could be decom­posed addi­tively, we first needed to apply an auto­mated Box-​​Cox trans­for­ma­tion. We tried a range of Box-​​Cox para­me­ters on a grid, and selected the one which gave the most nor­mal errors. That worked ok, but I’ve since found some papers that pro­vide quite good auto­mated Box-​​Cox algo­rithms that I’ve imple­mented in the fore­cast pack­age. So this code uses Guerrero’s (1993) method instead.

For sea­sonal time series, we decom­posed the trans­formed data using an stl decom­po­si­tion with peri­odic seasonality.

For non-​​seasonal time series, we esti­mated the trend of the trans­formed data using penal­ized regres­sion splines via the mgcv pack­age.

decomp <- function(x,transform=TRUE)
{
  require(forecast)
  # Transform series
  if(transform & min(x,na.rm=TRUE) >= 0)
  {
    lambda <- BoxCox.lambda(na.contiguous(x))
    x <- BoxCox(x,lambda)
  }
  else
  {
    lambda <- NULL
    transform <- FALSE
  }
  # Seasonal data
  if(frequency(x)>1)
  {
    x.stl <- stl(x,s.window="periodic",na.action=na.contiguous)
    trend <- x.stl$time.series[,2]
    season <- x.stl$time.series[,1]
    remainder <- x - trend - season
  }
  else #Nonseasonal data
  {
    require(mgcv)
    tt <- 1:length(x)
    trend <- rep(NA,length(x))
    trend[!is.na(x)] <- fitted(gam(x ~ s(tt)))
    season <- NULL
    remainder <- x - trend
  }
  return(list(x=x,trend=trend,season=season,remainder=remainder,
    transform=transform,lambda=lambda))
}

Putting every­thing on a [0,1] scale

We wanted to mea­sure a range of char­ac­ter­is­tics such as strength of sea­son­al­ity, strength of trend, level of non­lin­ear­ity, skew­ness, kur­to­sis, ser­ial cor­re­lat­ed­ness, self-​​similarity, level of chaotic­ity (is that a word?) and the peri­od­ic­ity of the data. But we wanted all these on the same scale which meant map­ping the nat­ural range of each mea­sure onto [0,1]. The fol­low­ing two func­tions were used to do this.

# f1 maps [0,infinity) to [0,1]
f1 <- function(x,a,b)
{
  eax <- exp(a*x)
  if (eax == Inf)
    f1eax <- 1
  else
    f1eax <- (eax-1)/(eax+b)
  return(f1eax)
}
 
# f2 maps [0,1] onto [0,1]
f2 <- function(x,a,b)
{
  eax <- exp(a*x)
  ea <- exp(a)
  return((eax-1)/(eax+b)*(ea+b)/(ea-1))
}

The val­ues of a and b in each func­tion were cho­sen so the mea­sure had a 90th per­centile of 0.10 when the data were iid stan­dard nor­mal, and a value of 0.9 using a well-​​known bench­mark time series.

Cal­cu­lat­ing the measures

Now we are ready to cal­cu­late the mea­sures on the orig­i­nal data, as well as on the adjusted data (after remov­ing trend and seasonality).

measures <- function(x)
{
  require(forecast)
 
  N <- length(x)
  freq <- find.freq(x)
  fx <- c(frequency=(exp((freq-1)/50)-1)/(1+exp((freq-1)/50)))
  x <- ts(x,f=freq)
 
  # Decomposition
  decomp.x <- decomp(x)
 
  # Adjust data
  if(freq > 1)
    fits <- decomp.x$trend + decomp.x$season
  else # Nonseasonal data
    fits <- decomp.x$trend
  adj.x <- decomp.x$x - fits + mean(decomp.x$trend, na.rm=TRUE)
 
  # Backtransformation of adjusted data
  if(decomp.x$transform)
    tadj.x <- InvBoxCox(adj.x,decomp.x$lambda)
  else
    tadj.x <- adj.x
 
  # Trend and seasonal measures
  v.adj <- var(adj.x, na.rm=TRUE)
  if(freq > 1)
  {
    detrend <- decomp.x$x - decomp.x$trend
    deseason <- decomp.x$x - decomp.x$season
    trend <- ifelse(var(deseason,na.rm=TRUE) < 1e-10, 0, 
      max(0,min(1,1-v.adj/var(deseason,na.rm=TRUE))))
    season <- ifelse(var(detrend,na.rm=TRUE) < 1e-10, 0,
      max(0,min(1,1-v.adj/var(detrend,na.rm=TRUE))))
  }
  else #Nonseasonal data
  {
    trend <- ifelse(var(decomp.x$x,na.rm=TRUE) < 1e-10, 0,
      max(0,min(1,1-v.adj/var(decomp.x$x,na.rm=TRUE))))
    season <- 0
  }
 
  m <- c(fx,trend,season)
 
  # Measures on original data
  xbar <- mean(x,na.rm=TRUE)
  s <- sd(x,na.rm=TRUE)
 
  # Serial correlation
  Q <- Box.test(x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
 
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(x))$statistic
  fp <- f1(p,0.069,2.304)
 
  # Skewness
  sk <- abs(mean((x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(sk,1.510,5.993)
 
  # Kurtosis
  k <- mean((x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
 
  # Hurst=d+0.5 where d is fractional difference.
  H <- fracdiff(na.contiguous(x),0,0)$d + 0.5
 
  # Lyapunov Exponent
  if(freq > N-10)
      stop("Insufficient data")
  Ly <- numeric(N-freq)
  for(i in 1:(N-freq))
  {
    idx <- order(abs(x[i] - x))
    idx <- idx[idx < (N-freq)]
    j <- idx[2]
    Ly[i] <- log(abs((x[i+freq] - x[j+freq])/(x[i]-x[j])))/freq
    if(is.na(Ly[i]) | Ly[i]==Inf | Ly[i]==-Inf)
      Ly[i] <- NA
  }
  Lyap <- mean(Ly,na.rm=TRUE)
  fLyap <- exp(Lyap)/(1+exp(Lyap))
 
  m <- c(m,fQ,fp,fs,fk,H,fLyap)
 
  # Measures on adjusted data
  xbar <- mean(tadj.x, na.rm=TRUE)
  s <- sd(tadj.x, na.rm=TRUE)
 
  # Serial
  Q <- Box.test(adj.x,lag=10)$statistic/(N*10)
  fQ <- f2(Q,7.53,0.103)
 
  # Nonlinearity
  p <- terasvirta.test(na.contiguous(adj.x))$statistic
  fp <- f1(p,0.069,2.304)
 
  # Skewness
  sk <- abs(mean((tadj.x-xbar)^3,na.rm=TRUE)/s^3)
  fs <- f1(sk,1.510,5.993)
 
  # Kurtosis
  k <- mean((tadj.x-xbar)^4,na.rm=TRUE)/s^4
  fk <- f1(k,2.273,11567)
 
  m <- c(m,fQ,fp,fs,fk)
  names(m) <- c("frequency", "trend","seasonal",
    "autocorrelation","non-linear","skewness","kurtosis",
    "Hurst","Lyapunov",
    "dc autocorrelation","dc non-linear","dc skewness","dc kurtosis")
 
  return(m)
}

Here is a quick exam­ple applied to Aus­tralian monthly gas production:

library(forecast)
measures(gas)
     frequency              trend           seasonal    autocorrelation 
        0.1096             0.9989             0.9337             0.9985 
    non-linear           skewness           kurtosis              Hurst 
        0.4947             0.1282             0.0055             0.9996 
      Lyapunov dc autocorrelation      dc non-linear        dc skewness 
        0.5662             0.1140             0.0538             0.1743 
   dc kurtosis 
        0.9992

The func­tion is far from per­fect, and it is not hard to find exam­ples where it fails. For exam­ple, it doesn’t work with mul­ti­ple sea­son­al­ity — try measure(taylor) and check the sea­son­al­ity. Also, I’m not con­vinced the kur­to­sis pro­vides any­thing use­ful here, or that the skew­ness mea­sure is done in the best way pos­si­ble. But it was really a proof of con­cept, so we will leave it to oth­ers to revise and improve the code.

In our papers, we took the mea­sures obtained using R, and pro­duced self-​​organizing maps using Vis­cov­ery. There is now a som pack­age in R for that, so it might be pos­si­ble to inte­grate that step into R as well. The den­do­gram was gen­er­ated in mat­lab, although that could now also be done in R using the ggden­dro pack­age for example.

Down­load the code in a sin­gle file.

Learn to design and build better data-rich applications with this free eBook from O’Reilly. Brought to you by Mesosphere DC/OS.

Topics:

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}