Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Performance Forecasting via Random Forests

DZone's Guide to

Performance Forecasting via Random Forests

Using Treasure Data to forecast sales of a retail store up to six weeks in advance with predictive analytics.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

The first challenge is predicting the retail sales for the Rossman stores (the full details at Kaggle). We will use an ensemble learning technique known as Random Forest regression.

Rossman is a pharmacy chain with over 3,000 stores in seven countries within Europe. The manager of each store has been tasked to predict the sales of the store for up to six weeks in advance. Sales of each store depends on various factors such as each store’s promotional activities, public and school holidays, seasonal and regional characteristics.

For the competition, Rossmann has provided training data for each day in 1,115 of their stores over six weeks in Germany from January 1, 2013 to July 31, 2015.  The test data set runs from August 1, 2015 to September 17, 2015.

Next we will explain in detail how to analyze this data on Treasure Data using Hivemall.

Preparing the Data

Importing Into Treasure Data

Rossman Store Sales task provides three datasets in CSV format: the training data set (train.csv), the verification data set (test.csv) and the store information data set. Stores are identified by Store IDs in all three datasets.

After downloading the data from Kaggle, you can drag-and-drop the files into Treasure Data through the File Upload option. Treasure Data automatically creates the table definition by looking at the header as well as the actual data columns.

Each data file is configured as follows:

Training Data (train.csv)

hivemall_trans1

Testing Data (test.csv)

hivemall_trans2

Store Information (store.csv)

hivemall_trans3

With data information described below:

COLUMN NAME TYPE DESCRIPTION
ID Unique Identifier An ID that represents a store and date duple within the test set
Sales Response Variable / Continuous the sales for one day for that shop (This is what you are predicting)
Store Categorical Unique ID assigned to each shop
Customers Continuous The number of customers visiting that store in one day
Open Categorical 0 if store is closed that day, 1 if open
StateHoliday Categorical Indicates a state holiday. Normally stores are closed on a holiday: a = public holiday, b = Easter holiday, c = christmas, 0 = None
SchoolHoliday Categorical Indicates if the store was affected by the closure of public schools on that date
StoreType Categorical Defines which of four different store models the store is: a, b, c, d
Assortment Categorical Describes the item assortment level: a = basic, b = extra, c = extended
CompetitionDistance Continuous The distance in meters to the nearest competitor store
CompetitionOpenSince [Month / Year] Categorical Gives the approximate year and month that the nearest competitor was opened
Promo Categorical Indicates where a store is running a promotion that date
Promo2 Categorical Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating
Promo2Since [Year / Week] Categorical Describes the year and calendar week when the store started participating in Promo2
PromoInterval Categorical

Describes the consecutive intervals Promo2 is started, naming the months the promotions is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store.

ID Unique Identifier An ID that represents a store and date duple within the test set Sales Response Variable / Continuous the sales for one day for that shop (This is what you are predicting) Store Categorical Unique ID assigned to each shop Customers Continuous The number of customers visiting that store in one day Open Categorical 0 if store is closed that day, 1 if open StateHoliday Categorical Indicates a state holiday. Normally stores are closed on a holiday: a = public holiday, b = Easter holiday, c = christmas, 0 = None SchoolHoliday Categorical Indicates if the store was affected by the closure of public schools on that date StoreType Categorical Defines which of four different store models the store is: a, b, c, d Assortment Categorical Describes the item assortment level: a = basic, b = extra, c = extended CompetitionDistance Continuous The distance in meters to the nearest competitor store CompetitionOpenSince [Month / Year] Categorical Gives the approximate year and month that the nearest competitor was opened Promo Categorical Indicates where a store is running a promotion that date Promo2 Categorical Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating Promo2Since [Year / Week] Categorical Describes the year and calendar week when the store started participating in Promo2 PromoInterval Categorical Describes the consecutive intervals Promo2 is started, naming the months the promotions is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

Visualizing the Data

Let’s take a look at the sales data for Store ID = 1. This script visualizes the data using the TD-pandas package and Jupyter Notebook.

Hivemall_trans4

By visualizing the data, we are able to review both the cyclic patterns of the amount of sales as well as the spikes in the data.

Visualization helps us infer what features (the day of week, weekend/weekdays, etc.) can be used to account for periodicity.

Pre-processing the Data

Joining the Data

First, we want to join the store information in the training set (including information such as the promotions and competing stores.

It should be noted that the date is a string in the format of yyyy-mm-dd (such as, for example, 2014-05-25), so we will need to extract year, month, and day as partial strings. Additionally, we excluded from the training any rows where sales were equal to zero that day.

WITH t1 as (
  SELECT  
    rowid() as rowid,
    t.stateholiday, t.store, t.promo, t.dayofweek, t.date, t.schoolholiday, t.sales, 
    SUBSTR(t.date,1,4) as year,
    SUBSTR(t.date,6,2) as month,
    SUBSTR(t.date,9,2) as day
  FROM
    train_original t 
  WHERE sales != 0
)
INSERT OVERWRITE TABLE training2
  SELECT
    t1.*, 
    t2.promo2sinceweek, 
    t2.competitionopensinceyear, 
    t2.assortment, 
    t2.promo2sinceyear, 
    t2.competitiondistance, 
    t2.promointerval, 
    t2.promo2, 
    t2.storetype, 
    t2.competitionopensincemonth
  FROM
    t1
    JOIN store_raw t2 ON (t1.store = t2.store)
;
view raw

The results of the above Hive query are stored in training2 and shown below:

hivemall_trans5

Generating the Feature Vector

The table, training2, that we created in the previous step, contains non-numeric data. In this step, we will convert the non-numeric data to numeric data so that we can create a feature vector that can be fed into the Random Forest algorithm. A feature vector is an array that represents each feature quantity as a numerical value.

WITH train_ordered as (
  select * from training2
  order by rowid asc
), 
train_quantified as (
  select
    t0.rowid,
    t0.sales,
    t2.*,
    t0.competitiondistance
  from
    train_ordered t0 
    -- indexing non-number columns
    LATERAL VIEW quantify(true,
          stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, promointerval, promo2, storetype, competitionopensincemonth, year, month, day) t2
       as stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
)
INSERT OVERWRITE TABLE training3
SELECT
  t1.rowid,
  ARRAY( -- padding zero for missing values
    t1.stateholiday, t1.store, t1.promo, t1.dayofweek, t1.schoolholiday,
    IF(t1.promo2sinceweek IS NULL, 0, t1.promo2sinceweek), 
    IF(t1.competitionopensinceyear IS NULL, 0, t1.competitionopensinceyear), 
    t1.assortment,
    IF(t1.promo2sinceyear IS NULL, 0, t1.promo2sinceyear),
    IF(t1.promointerval IS NULL, 0, t1.promointerval),
    t1.promo2, t1.storetype, 
    IF(t1.competitionopensincemonth IS NULL, 0, t1.competitionopensincemonth),
    t1.year, t1.month, t1.day,
    IF(t1.competitiondistance IS NULL, 0, t1.competitiondistance)
  ) AS features,
  LN(1 + t1.sales) AS label, -- log scale conversion
  t1.sales
FROM
  train_quantified t1
;

The quantify function seen here converts the non-numeric columns to numeric by outputting a numeric ID. In addition, missing values are replaced with zeroes and the response variable, sales, is transformed using a log scale conversion.

The converted table is as follows:

hivemall_trans6

Preparing the Testing Data 

We also need to transform the test data by joining the store information table with the testing data.

INSERT OVERWRITE TABLE testing2
SELECT
  rowid() as rowid,
  t1.id,
  t1.stateholiday, t1.store, t1.promo, t1.dayofweek, t1.date, t1.schoolholiday,
  SUBSTR(t1.date,1,4) as year,
  SUBSTR(t1.date,6,2) as month,
  SUBSTR(t1.date,9,2) as day,
  t2.promo2sinceweek,
  t2.competitionopensinceyear,
  t2.assortment,
  t2.promo2sinceyear,
  t2.competitiondistance,
  t2.promointerval,
  t2.promo2,
  t2.storetype,
  t2.competitionopensincemonth
FROM
  test_original t1
  JOIN store_raw t2 ON (t1.store = t2.store);

The resulting table is shown below:

hivemall_trans7

Next, it is necessary to make the same conversion of categorical to continuous variables that we did earlier in the training data, and then process the test data. For example, StoreType can be either a, b, or c but there is also a fourth type d, in the training data. If we mapped categorical variables separately for the training and the test data sets, there would be a mismatch later. The above transformation accounts for this case.

WITH train_test as (
      select
        1 as train_first, false as output_row,
	rowid, stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
      from
        training2
      union all
      select
        2 as train_first, true as output_row,
	rowid, stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
      from
        testing2
),
train_test_ordered as (
    select * from train_test
    order by train_first asc, rowid asc
),
test_quantified as (
  select
    t1.rowid,
    t2.*
  from 
    train_test_ordered t1 
    LATERAL VIEW quantify(output_row,
    	    stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
    ) t2 as stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
) 
INSERT OVERWRITE TABLE testing3
SELECT
  t3.rowid, t4.id,
  ARRAY(
    t3.stateholiday, t3.store, t3.promo, t3.dayofweek, t3.schoolholiday,
    IF(t3.promo2sinceweek IS NULL, 0, t3.promo2sinceweek), 
    IF(t3.competitionopensinceyear IS NULL, 0, t3.competitionopensinceyear),
    t3.assortment,
    IF(t3.promo2sinceyear IS NULL, 0, t3.promo2sinceyear), 
    IF(t3.promointerval IS NULL, 0, t3.promointerval),
    t3.promo2, t3.storetype, 
    IF(t3.competitionopensincemonth IS NULL, 0, t3.competitionopensincemonth),
    t3.year, t3.month, t3.day,
    IF(t3.competitiondistance IS NULL, 0, t3.competitiondistance)
 ) AS features
FROM
  test_quantified t3
  LEFT OUTER JOIN testing2 t4 ON (t3.rowid=t4.rowid);

The results of the final test tables is as follows:

hivemall_trans8

Using Random Forest

Now with the training data we created in the last step, we can finally apply the Random Forest algorithm. Random Forest is an ensemble learning technique that constructs multiple decision trees via randomization. In this analysis, we will build 100 different trees.

The query below runs the algorithm. Here, we create the 100 trees by running 5 parallel regressions that create 20 trees each and then using UNION ALL to bring them together. The ‘-Attrs’ flag defines what variables are categorical and which are continuous, with variables with the option C being categorical and those with an option Q being continuous. Only the column competition distance is a continuous variable.

INSERT OVERWRITE TABLE model 
  SELECT 
    -- C: Categorical Variable, Q: Quantitative Variable
    train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
  FROM
    training3
  UNION ALL
  SELECT 
    train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
  FROM
    training3
  UNION ALL
  SELECT 
    train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
  FROM
    training3
  UNION ALL
  SELECT 
    train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
  FROM
    training3
  UNION ALL
  SELECT 
    train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
  FROM
    training3;

Selecting Variable Importance

From the results of the Random Forest query, we can show how much each explanatory variable affects the model. Using a combination of Jupyter notebook, Pandas and some visualization, we can see the execution results as a bar chart.

The bar chart suggests that which store it is and the distance to its nearest competition affects the of the store the most. This is consistent with the intuition that stores themselves have a big influence on its sales numbers, and if there’s a competitor nearby, they can eat into your sales.

The Jupyter notebook that generated the visualization can be seen here.

Prediction

Next we’ll make a prediction using the model we created. The response variable at the time of learning is LN(1 + t1.sales) after converting the scale, so the reverse conversion would be EXP(predicted-1).

INSERT OVERWRITE TABLE prediction
SELECT 
  id,
  EXP(predicted)-1 as predicted
FROM(
  SELECT
     id,
     avg(predicted) AS predicted
  FROM(
    SELECT
      t.id,
      tree_predict(p.model_id, p.model_type, p.pred_model, t.features, false) as predicted
    FROM
      model p
      LEFT OUTER JOIN testing3 t
  ) t1
  group by
    id
) t2;

To submit the results to Kaggle, we’ll sort the prediction results in ascending order of Store ID.

SELECT 
  id,
  predicted as sales
FROM
  prediction
ORDER BY id ASC;

Evaluation

To evaluate the strength of our predictions, we will use the following equation:
CodeCogsEqn

Where yi is the sales of that the ith day for that store, and y^i is the predicted value.

The code use to cross-validate can be found here.

Conclusion

Kaggle is the perfect place to learn the basics of data analysis. The sales prediction analysis shown today could be applied to various applications, like e-commerce or advertising campaigns.

Treasure Data provides a flexible and turn-key environment to upload, pre-process and train data sets for machine learning. If you are interested in learning more, please follow the link below to request a demo.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
data science

Published at DZone with permission of Diana Shealy, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}