The first challenge is predicting the retail sales for the Rossman stores (the full details at Kaggle). We will use an ensemble learning technique known as Random Forest regression.

Rossman is a pharmacy chain with over 3,000 stores in seven countries within Europe. The manager of each store has been tasked to **predict the sales of the store for up to six weeks in advance**. Sales of each store depends on various factors such as each store’s promotional activities, public and school holidays, seasonal and regional characteristics.

For the competition, Rossmann has provided training data for each day in 1,115 of their stores over six weeks in Germany from January 1, 2013 to July 31, 2015. The test data set runs from August 1, 2015 to September 17, 2015.

Next we will explain in detail how to analyze this data on Treasure Data using Hivemall.

## Preparing the Data

### Importing Into Treasure Data

Rossman Store Sales task provides three datasets in CSV format: the training data set (train.csv), the verification data set (test.csv) and the store information data set. Stores are identified by Store IDs in all three datasets.

After downloading the data from Kaggle, you can drag-and-drop the files into Treasure Data through the File Upload option. Treasure Data automatically creates the table definition by looking at the header as well as the actual data columns.

Each data file is configured as follows:

#### Training Data (train.csv)

#### Testing Data (test.csv)

#### Store Information (store.csv)

With data information described below:

COLUMN NAME | TYPE | DESCRIPTION |
---|---|---|

ID | Unique Identifier | An ID that represents a store and date duple within the test set |

Sales | Response Variable / Continuous | the sales for one day for that shop (This is what you are predicting) |

Store | Categorical | Unique ID assigned to each shop |

Customers | Continuous | The number of customers visiting that store in one day |

Open | Categorical | 0 if store is closed that day, 1 if open |

StateHoliday | Categorical | Indicates a state holiday. Normally stores are closed on a holiday: a = public holiday, b = Easter holiday, c = christmas, 0 = None |

SchoolHoliday | Categorical | Indicates if the store was affected by the closure of public schools on that date |

StoreType | Categorical | Defines which of four different store models the store is: a, b, c, d |

Assortment | Categorical | Describes the item assortment level: a = basic, b = extra, c = extended |

CompetitionDistance | Continuous | The distance in meters to the nearest competitor store |

CompetitionOpenSince [Month / Year] | Categorical | Gives the approximate year and month that the nearest competitor was opened |

Promo | Categorical | Indicates where a store is running a promotion that date |

Promo2 | Categorical | Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating |

Promo2Since [Year / Week] | Categorical | Describes the year and calendar week when the store started participating in Promo2 |

PromoInterval | Categorical | Describes the consecutive intervals Promo2 is started, naming the months the promotions is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store. |

ID Unique Identifier An ID that represents a store and date duple within the test set Sales Response Variable / Continuous the sales for one day for that shop (This is what you are predicting) Store Categorical Unique ID assigned to each shop Customers Continuous The number of customers visiting that store in one day Open Categorical 0 if store is closed that day, 1 if open StateHoliday Categorical Indicates a state holiday. Normally stores are closed on a holiday: a = public holiday, b = Easter holiday, c = christmas, 0 = None SchoolHoliday Categorical Indicates if the store was affected by the closure of public schools on that date StoreType Categorical Defines which of four different store models the store is: a, b, c, d Assortment Categorical Describes the item assortment level: a = basic, b = extra, c = extended CompetitionDistance Continuous The distance in meters to the nearest competitor store CompetitionOpenSince [Month / Year] Categorical Gives the approximate year and month that the nearest competitor was opened Promo Categorical Indicates where a store is running a promotion that date Promo2 Categorical Promo2 is a continuing and consecutive promotion for some stores: 0 = store is not participating, 1 = store is participating Promo2Since [Year / Week] Categorical Describes the year and calendar week when the store started participating in Promo2 PromoInterval Categorical Describes the consecutive intervals Promo2 is started, naming the months the promotions is started anew. E.g. “Feb,May,Aug,Nov” means each round starts in February, May, August, November of any given year for that store

### Visualizing the Data

Let’s take a look at the sales data for Store ID = 1. This script visualizes the data using the TD-pandas package and Jupyter Notebook.

By visualizing the data, we are able to review both the cyclic patterns of the amount of sales as well as the spikes in the data.

Visualization helps us infer what features (the day of week, weekend/weekdays, etc.) can be used to account for periodicity.

### Pre-processing the Data

#### Joining the Data

First, we want to join the store information in the training set (including information such as the promotions and competing stores.

It should be noted that the date is a string in the format of yyyy-mm-dd (such as, for example, 2014-05-25), so we will need to extract year, month, and day as partial strings. Additionally, we excluded from the training any rows where sales were equal to zero that day.

```
WITH t1 as (
SELECT
rowid() as rowid,
t.stateholiday, t.store, t.promo, t.dayofweek, t.date, t.schoolholiday, t.sales,
SUBSTR(t.date,1,4) as year,
SUBSTR(t.date,6,2) as month,
SUBSTR(t.date,9,2) as day
FROM
train_original t
WHERE sales != 0
)
INSERT OVERWRITE TABLE training2
SELECT
t1.*,
t2.promo2sinceweek,
t2.competitionopensinceyear,
t2.assortment,
t2.promo2sinceyear,
t2.competitiondistance,
t2.promointerval,
t2.promo2,
t2.storetype,
t2.competitionopensincemonth
FROM
t1
JOIN store_raw t2 ON (t1.store = t2.store)
;
view raw
```

The results of the above Hive query are stored in training2 and shown below:

#### Generating the Feature Vector

The table, training2, that we created in the previous step, contains non-numeric data. In this step, we will convert the non-numeric data to numeric data so that we can create a feature vector that can be fed into the Random Forest algorithm. A feature vector is an array that represents each feature quantity as a numerical value.

```
WITH train_ordered as (
select * from training2
order by rowid asc
),
train_quantified as (
select
t0.rowid,
t0.sales,
t2.*,
t0.competitiondistance
from
train_ordered t0
-- indexing non-number columns
LATERAL VIEW quantify(true,
stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, promointerval, promo2, storetype, competitionopensincemonth, year, month, day) t2
as stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
)
INSERT OVERWRITE TABLE training3
SELECT
t1.rowid,
ARRAY( -- padding zero for missing values
t1.stateholiday, t1.store, t1.promo, t1.dayofweek, t1.schoolholiday,
IF(t1.promo2sinceweek IS NULL, 0, t1.promo2sinceweek),
IF(t1.competitionopensinceyear IS NULL, 0, t1.competitionopensinceyear),
t1.assortment,
IF(t1.promo2sinceyear IS NULL, 0, t1.promo2sinceyear),
IF(t1.promointerval IS NULL, 0, t1.promointerval),
t1.promo2, t1.storetype,
IF(t1.competitionopensincemonth IS NULL, 0, t1.competitionopensincemonth),
t1.year, t1.month, t1.day,
IF(t1.competitiondistance IS NULL, 0, t1.competitiondistance)
) AS features,
LN(1 + t1.sales) AS label, -- log scale conversion
t1.sales
FROM
train_quantified t1
;
```

The quantify function seen here converts the non-numeric columns to numeric by outputting a numeric ID. In addition, missing values are replaced with zeroes and the response variable, sales, is transformed using a log scale conversion.

The converted table is as follows:

#### Preparing the Testing Data

We also need to transform the test data by joining the store information table with the testing data.

```
INSERT OVERWRITE TABLE testing2
SELECT
rowid() as rowid,
t1.id,
t1.stateholiday, t1.store, t1.promo, t1.dayofweek, t1.date, t1.schoolholiday,
SUBSTR(t1.date,1,4) as year,
SUBSTR(t1.date,6,2) as month,
SUBSTR(t1.date,9,2) as day,
t2.promo2sinceweek,
t2.competitionopensinceyear,
t2.assortment,
t2.promo2sinceyear,
t2.competitiondistance,
t2.promointerval,
t2.promo2,
t2.storetype,
t2.competitionopensincemonth
FROM
test_original t1
JOIN store_raw t2 ON (t1.store = t2.store);
```

The resulting table is shown below:

Next, it is necessary to make the same conversion of categorical to continuous variables that we did earlier in the training data, and then process the test data. For example, StoreType can be either a, b, or c but there is also a fourth type d, in the training data. If we mapped categorical variables separately for the training and the test data sets, there would be a mismatch later. The above transformation accounts for this case.

```
WITH train_test as (
select
1 as train_first, false as output_row,
rowid, stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
from
training2
union all
select
2 as train_first, true as output_row,
rowid, stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
from
testing2
),
train_test_ordered as (
select * from train_test
order by train_first asc, rowid asc
),
test_quantified as (
select
t1.rowid,
t2.*
from
train_test_ordered t1
LATERAL VIEW quantify(output_row,
stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
) t2 as stateholiday, store, promo, dayofweek, schoolholiday, promo2sinceweek, competitionopensinceyear, assortment, promo2sinceyear, competitiondistance, promointerval, promo2, storetype, competitionopensincemonth, year, month, day
)
INSERT OVERWRITE TABLE testing3
SELECT
t3.rowid, t4.id,
ARRAY(
t3.stateholiday, t3.store, t3.promo, t3.dayofweek, t3.schoolholiday,
IF(t3.promo2sinceweek IS NULL, 0, t3.promo2sinceweek),
IF(t3.competitionopensinceyear IS NULL, 0, t3.competitionopensinceyear),
t3.assortment,
IF(t3.promo2sinceyear IS NULL, 0, t3.promo2sinceyear),
IF(t3.promointerval IS NULL, 0, t3.promointerval),
t3.promo2, t3.storetype,
IF(t3.competitionopensincemonth IS NULL, 0, t3.competitionopensincemonth),
t3.year, t3.month, t3.day,
IF(t3.competitiondistance IS NULL, 0, t3.competitiondistance)
) AS features
FROM
test_quantified t3
LEFT OUTER JOIN testing2 t4 ON (t3.rowid=t4.rowid);
```

The results of the final test tables is as follows:

### Using Random Forest

Now with the training data we created in the last step, we can finally apply the Random Forest algorithm. Random Forest is an ensemble learning technique that constructs multiple decision trees via randomization. In this analysis, we will build 100 different trees.

The query below runs the algorithm. Here, we create the 100 trees by running 5 parallel regressions that create 20 trees each and then using UNION ALL to bring them together. The ‘-Attrs’ flag defines what variables are categorical and which are continuous, with variables with the option C being categorical and those with an option Q being continuous. Only the column competition distance is a continuous variable.

```
INSERT OVERWRITE TABLE model
SELECT
-- C: Categorical Variable, Q: Quantitative Variable
train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
FROM
training3
UNION ALL
SELECT
train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
FROM
training3
UNION ALL
SELECT
train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
FROM
training3
UNION ALL
SELECT
train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
FROM
training3
UNION ALL
SELECT
train_randomforest_regr(features, label, '-trees 20 -attrs C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,Q')
FROM
training3;
```

#### Selecting Variable Importance

From the results of the Random Forest query, we can show how much each explanatory variable affects the model. Using a combination of Jupyter notebook, Pandas and some visualization, we can see the execution results as a bar chart.

The bar chart suggests that which store it is and the distance to its nearest competition affects the of the store the most. This is consistent with the intuition that stores themselves have a big influence on its sales numbers, and if there’s a competitor nearby, they can eat into your sales.

The Jupyter notebook that generated the visualization can be seen here.

### Prediction

Next we’ll make a prediction using the model we created. The response variable at the time of learning is LN(1 + t1.sales) after converting the scale, so the reverse conversion would be EXP(predicted-1).

```
INSERT OVERWRITE TABLE prediction
SELECT
id,
EXP(predicted)-1 as predicted
FROM(
SELECT
id,
avg(predicted) AS predicted
FROM(
SELECT
t.id,
tree_predict(p.model_id, p.model_type, p.pred_model, t.features, false) as predicted
FROM
model p
LEFT OUTER JOIN testing3 t
) t1
group by
id
) t2;
```

To submit the results to Kaggle, we’ll sort the prediction results in ascending order of Store ID.

```
SELECT
id,
predicted as sales
FROM
prediction
ORDER BY id ASC;
```

### Evaluation

To evaluate the strength of our predictions, we will use the following equation:

Where y_{i} is the sales of that the *i*th day for that store, and y^{^}_{i} is the predicted value.

The code use to cross-validate can be found here.

### Conclusion

Kaggle is the perfect place to learn the basics of data analysis. The sales prediction analysis shown today could be applied to various applications, like e-commerce or advertising campaigns.

Treasure Data provides a flexible and turn-key environment to upload, pre-process and train data sets for machine learning. If you are interested in learning more, please follow the link below to request a demo.

## {{ parent.title || parent.header.title}}

## {{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}