Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Programming Boosted Trees Using BigML's API

DZone's Guide to

Programming Boosted Trees Using BigML's API

Anything that can be done with the BigML Dashboard can be done using the BigML API — including programming Boosted Trees.

· Big Data Zone
Free Resource

See how the beta release of Kubernetes on DC/OS 1.10 delivers the most robust platform for building & operating data-intensive, containerized apps. Register now for tech preview.

In this, the fourth of our blog posts for the Winter 2017 release, we will explore how to use boosted Trees from the API. Boosted Trees are the latest supervised learning technique in BigML’s toolbox. As we have seen, they differ from more traditional ensembles in that no tree tries to make a correct prediction on its own, but rather are designed to nudge the overall ensemble towards the correct answer.

This post will be very similar to our second post about using Boosted Trees in the BigML Dashboard. Anything that can be done from the Dashboard can be done with our API. Resources created using the BigML API can all be seen in the Dashboard view, as well, so you can take full advantage of our visualizations.

Image title

If you have never used the API before, you will need to go through a quick setup. Simply set the environment variables BIGML_USERNAMEBIGML_API_KEY, and BIGML_AUTH. BIGML_USERNAME is just your username. Your BIGML_API_KEY can be found in the Dashboard by clicking on your username to pull up the Account page, and then clicking API Key. BIGML_AUTH is set as a combination of the two:

"username=$BIGML_USERNAME;api_key=$BIGML_API_KEY;"
1. Upload Your Data

Just as with the Dashboard, your first step is uploading some data to be processed. You can point to a remote source, or upload directly from your computer in a variety of popular file formats.

To do this, you can use the terminal with curl (or any other command-line tool that takes HTTPS methods). In this example, we are uploading the local file we used in our ‘oscars.csv’.

curl "https://bigml.io/source?$BIGML_AUTH"
       -F file=@oscars.csv

2. Create a Dataset

A BigML dataset resource is a serialized form of your data, with some simple statistics already calculated and ready to be processed by Machine Learning algorithms. To create a dataset from your uploaded data, use:

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"source": "source/58c05080983efc27100012fd"}'

In order to know we are creating a meaningful Boosted Tree, we need to split this dataset into two parts: a training dataset to create the model and a test dataset to evaluate how the model is doing. We will need two more commands to do just that:

curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, "seed":"foo"}'
curl "https://bigml.io/dataset?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"origin_dataset" : "dataset/58c051f6983efc2710001302", \
            "sample_rate" : 0.8, “out_of_bag” : true, "seed":"foo"}'

This is pretty similar to how we created our dataset — with some key differences. First, since we are creating these datasets from another dataset, we need to use origin_dataset. We are sampling at a rate of 80% for the first training dataset, and then setting out_of_bag to true to get the other 20% for the second test dataset. The seed is arbitrary, but we need to use the same one for each dataset.

3. Create Your Boosted Trees

Using the training dataset, we will now make an ensemble. A BigML ensemble will construct Boosted Trees if it is passed a parameter "boosting", a map of other parameters. In the example below, "boosting" will use ten iterations with a learning rate of 10%. BigML automatically picks the last field of your dataset as the objective field. If this is incorrect, you will want to explicitly pass it the objective field ID.

curl "https://bigml.io/ensemble?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10}}'

Some other parameters for Boosting include:

  • early_holdout:The portion of the dataset that will be held out for testing at the end of every iteration. If no significant improvement is made on the holdout, Boosting will stop early. The default is 0.

  • early_out_of_bag:Whether Out of Bag samples are tested after every iteration and may result in an early stop if no significant improvement is made. To use this option, an "ensemble_sample" must also be requested. The default is true.

  • ensemble_sample: The portion of the input dataset to be sampled for each iteration in the ensemble. The default rate is 1, with replacement true.

For example, we will try setting "early_out_of_bag" to true. To do this, we will also have to set an "ensemble_sample", say to 65%. This looks like:

curl "https://bigml.io/ensemble?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c053ac983efc2708000bbf", \
            "objective_field":"000013", \
            "boosting": {"iterations":10, "learning_rate":0.10, "early_out_of_bag":true} \
            "ensemble_sample": {"rate": 0.65, "replacement": false, "seed": "foo"}}'

4. Evaluate Your Boosted Trees

In order to see how well your model is performing, you will need to evaluate it against some test data. This will return an evaluation resource with a result object. For classification models, this will include accuracy, average_f_measureaverage_phi, average_precisionaverage_recall, and a confusion_matrix for the model.

So that we can be sure the model is making useful predictions, we include these same statistics for two simplistic alternative predictors: one that picks random classes and one that always picks the most common class.

For regression models, we include the average_error, mean_squared_error, and r_squared. Similarly, we compare regression models to a random predictor and a predictor that always chooses the mean.

curl "https://bigml.io/evaluation?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"dataset": "dataset/58c0543e983efc2702000c51", \
            "ensemble": "ensemble/58c05480983efc2710001306"}'

5. Make Predictions

Once you are satisfied with your evaluation, you can create one last Boosted Trees model with your entire dataset. Now, it is ready to make predictions on some new data. This is done in similar fashion to other BigML models.

curl "https://bigml.io/batchprediction?$BIGML_AUTH"
       -X POST \
       -H 'content-type: application/json' \
       -d '{"ensemble": "ensemble/58c05480983efc2710001306", \
            "dataset": "dataset/58c0543e983efc2702000c51"}'

In our next post of the series, we will see how to automate these steps with WhizzML, BigML’s domain-specific scripting language, and the Python bindings.

New Mesosphere DC/OS 1.10: Production-proven reliability, security & scalability for fast-data, modern apps. Register now for a live demo.

Topics:
big data ,data analytics ,boosted trees ,api ,tutorial

Published at DZone with permission of Adam Ashenfelter, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}