# Lessons Learned While Solving Time-Series Forecasting as a Supervised ML Problem: Part 1

# Lessons Learned While Solving Time-Series Forecasting as a Supervised ML Problem: Part 1

### Let's take a look at some lessons learned while solving time-series forecasting as a supervised ML problem.

Join the DZone community and get the full member experience.

Join For Free**Did you know that 50- 80% of your enterprise business processes can be automated with AssistEdge? Identify processes, deploy bots and scale effortlessly with AssistEdge.**

Time-Series Forecasting is one of the most common prediction problems solved by statisticians and data scientists over the past several decades, with the most frequent being the classic Demand/Sales and Weather Forecasting. The popular methods of solving these problems are Holt’s Winter, Exponential Smoothing (ETS), and ARIMA, all being the statistical univariate time-series methods. There are few other methods like Prophet from Facebook, THIEF, Unobserved Components Model (UCM), etc. used by the statisticians for forecasting. However, with the increased prominence of ML in the last decade, a lot of research has been conducted in solving it as a supervised Machine Learning problem. There are tree-based algorithms like Gradient Boosted Trees, Random Decision Forests, and Deep Learning-based algorithms like LSTM, all solved as supervised Machine Learning, that have outperformed traditional methods on many occasions. In this article, we will focus on sharing our learnings and best-practices around solving time-series forecasting problem as Supervised ML, specifically in the areas of data transformations, feature engineering, and modeling approaches.

## Data Transformations

### Missing Value Imputation

As we all know, one of the most commonly encountered problems while working on time-series data is that of missing values. Therefore, it is necessary to impute missing values before applying any ML specific transforms.

There could be multiple strategies to tackle this problem depending upon a) time-series decomposition into trend and seasonality and b) number of missing observations as % of total observations. As the number of missing values go up, the choice of imputation technique becomes more significant.

In general, the popular statistical methods of Mean, Median, Mode, and Random imputation does not work well for time-series with more number of missing values or presence of trend or seasonal variations. It requires time-series specific techniques that work on the assumption that adjacent observations are similar to each other. Some of the faster and easy to implement techniques are Last Observation Carried Forward (LOCF), Next Observation Carried Backwards (NOCB) or a combination of these approaches like Mean of LO+NO or LOCF+NOCB in case of more than one missing value. There are more advanced techniques like Linear or Spline Interpolation, Moving Average methods like Simple, Weighted and Exponential Smoothing to handle missing values. For seasonal time-series, one may also remove the seasonal component from the time series, use interpolation or smoothing techniques to impute missing values and add seasonal component back.

Please note that selection of impute strategy is the function of time-series itself and therefore, no single approach can be termed as the most accurate for all kinds of time-series. However, the following steps could be used as starting points to select an appropriate strategy.

- No Trend, No Seasonality and only a few missing values
- Popular methods of Mean, Median, Mode, Random or LOCF/NOCB

- Presence of Trend but No Seasonality
- Linear/Spline Interpolation or Simple, Weighted or Exponential Smoothing

- Presence of Seasonality but No Trend
- Deseasonalize the time-series, use Step 1 techniques for imputation, and re-seasonalize the time-series

- Presence of Trend as well as Seasonality
- Deseasonalize the time-series, use step 2 techniques for imputation, and re-seasonalize the time-series

In general, start with a couple of strategies in mind based on time-series decomposition and use final predictions to determine the appropriate strategy or hold out few samples of time-series, impute those, compare with actual values, and select the strategy with the least absolute error.

## Sliding Window Transform

The next and more ML-specific step to solve any time-series problem as a supervised learning problem is to transform it into a supervised learning structure, wherein a few predictor variables are used to predict a target value. Also, it is essential to convert it into IID (Independent and Identically Distributed) so that each sample of the time-series would be independent of other samples and identically distributed in time. Once IID-transformed, we can use any of the standard IID ML Methods for prediction.

Sliding window transform of Size N can be applied to the time-series to perform both these tasks. When the window size is 2, two observations (1^{st} and 2^{nd}) are used to predict the next or target (3^{rd}) observation. Then, the window is slid by one unit of time so that 2^{nd} and 3^{rd} observations are used to predict the 4^{th} observation and so on. Once transformed, the dataset will take the form as illustrated below.

Again, the choice of siding window size is very important for these problems and it is dependent on parameters like how each observation is distributed in time (daily, weekly, monthly, quarterly, yearly etc.), how many observations in history are available for prediction (2 years, 10 years, 50 years and so on), ACF and PACF cut-off values at various lags, what is the demonstrated seasonality etc.

In general, when window size is smaller, the model has limited numbers of features and hence limited hyperparameters available. On the other hand, if window size is large, it increases the risk for model to learn from features that may not have an impact on target variable. Also, the training data length (number of rows) keeps on decreasing with increase in window size, which in turn impacts model learnability. Therefore, the choice of appropriate window size is a tradeoff between parameter risk and modeling risk.

Few sample window sizes that have worked for us are /24 for monthly observations with history data of 5-7 years and annual seasonality, for weekly observations with history of 4-5 years and annual seasonality. It is advisable to try out a couple of experiments with different values to arrive at the final optimal window size that may work for your forecasting problem.

### Feature Engineering

Another aspect of solving a time-series problem as supervised learning problem is to provide additional features to the model in order for it to learn seasonal and trend variations that go beyond the size of the sliding window. We have listed some of the common features (from univariate time-series) that might enhance the model performance. Also, it is quite natural for ML models to handle external features like weather, macro-economic indices like inflation, industrial output, GDP per capita etc i.e. multi-variate forecasting. However, we will not discuss those in this article.

### Date Features

Creating features like Day of the month, Day of the week, Week of the year, Month Number, Quarter Number, etc. helps in capturing weekly, monthly, and annual seasonality of time-series. Additionally, it also helps to create features for holidays and weekends, especially for sales forecasting use cases wherein there could be a potential impact on sales on these days. It also helps to have features that capture specific events depending on the use cases like End of the month or Start of the month for a bank account balance use case. These are some of the techniques that help ML models to produce more accurate forecasts.

### Descriptive Features Using Historical Data

It also helps to add features like mean, max, min, and median sales for the same period last year/month, etc., which is not covered by the sliding window. For instance, if you are using daily sales of a month September 2017 (30 data points) to forecast next 7 days’ sales (October 1-7, 2017), then it helps to add descriptive stats of past 2 months as well as the same period last year i.e. Mean Daily Sales in August, 2017, Mean Daily Sales in July, 2017 and Mean Daily Sales in October, 2016. Adding these features help the model to learn trend and seasonal variations month over month or year over year.

These are some of the data transformation and feature engineering related considerations to be kept in mind while solving a time-series forecasting problem as Supervised ML. We have used Infosys NIA Machine Learning Platform to perform these data transformations for real-life time-series forecasting use cases. Infosys NIA ML platform is an end-to-end data science platform that provides the ability to automate these data manipulation practices using in-built author snippet procedure.

In part 2 of the article, we will share our learnings on the modeling strategies to be used for time-series forecasting problem as Supervised ML.

**Consuming AI in byte sized applications is the best way to transform digitally. #BuiltOnAI, EdgeVerve’s business application, provides you with everything you need to plug & play AI into your enterprise. Learn more.**

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}