# Applying Time Series Analysis to Uncover Insights From Incident Data

### A detailed tutorial on how to use the R programming language to analyze time series data in order to uncover trends and better understand your data.

Join the DZone community and get the full member experience.

Join For FreeTime series analysis is a technique of methodically studying timestamped observations or incidents to discover and separate broader trends and recurring patterns from random occurrences.

*Example of Time Series Data*

I have generalized a set of observations to call it simply "Incidents." These could be machine or equipment failure events, for instance, or may represent some other business information, such as revenue or cost. The point is we hope to determine the overall trend (growing/declining, increasing/decreasing) if any, and recurring themes around such trends.

So, how do we start?

We read up (using any language) a given set of observations and convert it into a time-series object. For instance, in R there is a function called `ts`

** **that takes in the observations in matrix form.

You will also need to know the frequency at which data is sampled. This can also be in terms of time in between two such successive sampling events, expressed relatively to the time period over which data collection is performed, such as hourly (1/24) over a day, daily (1/7) over a week, and so on.

Visually, we already see a pattern and trend. However, the collected data must pass the **stationarity **test,. otherwise we will need to transform the data to make it so. The patterns and variance must be consistent in order to draw inferences about what influences the data's behavior.

For instance, if we see a trend, we need to **de-trend** in order to observe the underlying patterns more closely.

This is usually done using linear regression techniques. Analysis of how residuals from this regression fit will be the next step.

For instance, the Dickey-Fuller Stationarity test applies a linear regression fit on a number of time lagged incidents which were depicted earlier.

`adf.test(ts(tseriesX), alternative="stationary")`

` Augmented Dickey-Fuller Test`

`data: ts(tseriesX)`

`Dickey-Fuller = -0.93943, Lag order = 3, p-value = 0.9332`

`alternative hypothesis: stationary`

We can see that based on `p-value`

that the null hypothesis of being non-stationary can't be rejected.

Another way is to simply take a difference, so further analysis is focused on differentials rather than absolute magnitude.

Let's try the same test on a differenced observation set one more time:

`xxxxxxxxxx`

`> adf.test(ts(diff(tseriesX)), alternative="stationary")`

` Augmented Dickey-Fuller Test`

`data: ts(diff(tseriesX))`

`Dickey-Fuller = -5.9245, Lag order = 3, p-value = 0.01`

`alternative hypothesis: stationary`

Now, the null hypothesis can be rejected and, thus, we can conclude that the differenced form is indeed stationary.

So, we just got rid of the trending upwards effect and the pattern is before us. It should now have a constant variance around its mean.

If the **variance **also** **is not constant, an additional transformation such as a **log **or **power **function can be applied. To verify this, you could take random observation samples of the post-application transformation function, and then do a **Bartlett's **or **Levene's **test to verify that variances among the samples are equal.

*Boxplot of samples taken*

`xxxxxxxxxx`

`Levene's Test for Homogeneity of Variance (center = median)`

` Df F value Pr(>F)`

`group 4 0.2158 0.9266`

` 20 `

In this specific case, we do not need an additional transformation, since the goal of stationarity is achieved already.

At this time, it would be good to know how many time stamps we need to look back at (i.e. the time lags).

We kind of got an answer to this when we perform the **adf **test. This test computes (or rather sets a limit on) the number of lags needed for a regression fit. You can see that the test results showed the lag order considered as **3.**

You can also do an auto-correlation plot to understand how many lags happen to be significant.

`xxxxxxxxxx`

` [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]`

`ACF -0.72 0.40 -0.22 0.10 0.05 -0.1 0.11 -0.10`

`PACF -0.72 -0.24 -0.08 -0.08 0.16 0.1 0.09 0.03`

An ACF plot lays out the correlation between the present value of incidents and their time lagged values. From the above output, it is easy to see that **first 3 lags** seem to be the most significant.

A PACF plot explains the unaccounted impact of including an additional lag on the regression model fit. Again, it cannot go** beyond the first 2 lags **to be significant.

Let's try decomposing the stationary time series to uncover trends and seasonalities.

`xxxxxxxxxx`

`plot(decompose(diff, type="additive"))`

`centrifuged<-decompose(diff, type="additive")`

`plot(centrifuged)`

*Series decomposition to reveal underlying components.*

`xxxxxxxxxx`

`fit<-auto.arima(centrifuged$x) ## trend and remainder resp.`

`fit`

`xxxxxxxxxx`

`Series: centrifuged$x `

`ARIMA(1,0,1) with non-zero mean `

`Coefficients:`

` ar1 ma1 mean`

` -0.5674 -0.5133 12.0965`

`s.e. 0.1610 0.1554 3.6513`

`sigma^2 estimated as 4855: log likelihood=-197.23`

`AIC=402.47 AICc=403.8 BIC=408.69`

This means that the present value of Incidents, as in the observation set we considered, is co-related with previous lagged values along with forecasting error term.s Here is an example equation to model it.

`xxxxxxxxxx`

`(IncidentsFitted)t = Konstant + k1*Incidents(t-1) + k2*ForecastError(t-1) `

`where ForecastError(t-1)= (Incidents)t-1 - (IncidentsFitted)t-1`

You can also choose to cross-check this with other modeling approaches, such as ETS or Holt-Winters.

Opinions expressed by DZone contributors are their own.

Comments