You can start using powerful machine learning tools quickly and easily using different open-source packages, but tuning these models is often a non-intuitive, time-consuming process. The tunable parameters (hyperparameters) of the models themselves can greatly affect their accuracy. While all of these tools attempt to set reasonable default hyperparameters for you, they can often fail to provide optimal results for many real-world datasets in practice. When every model evaluation can take hours or days on powerful clusters and the model fit can have a large impact on your overall system, it is important to find the best hyperparameters as quickly as possible.
In this post, we’ll show you how different hyperparameter optimization strategies like using model defaults, grid search, random search, and Bayesian Optimization (SigOpt) can change the model fit for various classifiers and famous datasets.
The poker dataset tries to classify poker hands given a set of five cards; we’ll train on 10,000 random poker hands and test using a different 10,000 random poker hands. The connect-4 dataset classifies the winner of a game of Connect Four given a game state; we’ll train on 60,000 games and test on 7,557 games. The USPS dataset tries to classify handwritten digits in zip codes; we’ll train on 7,291 images and test on 2,007 images. The satimage dataset attempts to classify soil types using satellite images; we’ll train on 9,539 images and test on 1,331 images. The hyperparameter tuning methods described below can be used for any dataset and any classifier. The code for running these examples is available on GitHub. You can easily modify it to use your data or classifier of choice.
Hyperparameters to Optimize
Each classifier attempts to build a model given the training data that will have the best model fit on the testing data. Both the GBC and SVC classifier have several tunable hyperparameters that can greatly affect the model fit.
Default Hyperparameters vs. Bayesian Optimization
Scikit-learn makes it very easy to get these classifiers up and running and provides default values for the hyperparameters that try to fit a wide variety of datasets. Because these hyperparameters are not tuned for any specific dataset, they often produce a sub-optimal fit for your specific problem. Bayesian Optimization (via SigOpt) beats the default hyperparameters, allowing you to achieve a better model fit than the defaults.
Grid Search vs. Bayesian Optimization
Grid search cuts up the space of possible hyperparameters into equal sized (in each dimension) grids and samples at each intersection of the grid. This provides a uniform search over the space but is exponential in the number of dimensions being searched over. SigOpt finds better hyperparameters than grid search with fewer function evaluations. While grid search is exponential in the dimension of hyperparameters, we have found in practice that SigOpt finds optima in a linear number of evaluations. We can see the massive speed gains in these three- and four-dimensional spaces already. For many complex machine learning tasks, evaluation can take hours or even days on supercomputers — so every evaluation is precious. Four-dimensional spaces already.
Column 1 shows the classifier and dataset being compared. SigOpt speed denotes how much faster SigOpt was able to find an optima versus grid search in terms of model evaluations. SigOpt versus Grid shows how much better the optima SigOpt was able to find versus the best optima that grid search had found with the same number of model evaluations. SigOpt versus full grid shows the gain SigOpt was able to find in the much smaller number of evaluations vs an exhaustive grid search (192 evaluations) of the space.
Random search picks random hyperparameters from the space and sees how they change the model fit. After some fixed number of iterations, the best values observed are used. While this method allows the user to potentially stumble upon the best hyperparameters, it also fares worse than SigOpt, which can find better hyperparameters faster.
Column 1 shows the classifier and dataset being compared. SigOpt speed denotes how much faster SigOpt was able to find an optima versus a uniform random search in the total number of model evaluations before finding an optima. SigOpt fit verus random fit shows how much better an optima SigOpt was able to find versus the best point that random search had found with the same number of model evaluations.
Model Tuning Matters
Evaluating different hyperparameters for a model is very time-consuming and expensive as you train on more data. Model fit for things like CTR prediction or user recommendations can have a large impact on your overall system and bottom line. Using the right tools to train your models can save you time and money.