Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

scikit-learn: Using GridSearch to Tune the Hyperparameters of VotingClassifier

DZone 's Guide to

scikit-learn: Using GridSearch to Tune the Hyperparameters of VotingClassifier

When building a classification ensemble, you need to be sure that the right classifiers are being included and the wrong ones are being excluded. Here's how to do that.

· AI Zone ·
Free Resource

In my last post, I showed how to create a multi-class classification ensemble using scikit-learn’s VotingClassifier and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble — otherwise, they can be excluded.

We have a TF/IDF-based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
 
Y_COLUMN = "author"
TEXT_COLUMN = "text"
 
unigram_log_pipe = Pipeline([
    ('cv', CountVectorizer()),
    ('logreg', linear_model.LogisticRegression())
])
 
ngram_pipe = Pipeline([
    ('cv', CountVectorizer(ngram_range=(1, 2))),
    ('mnb', MultinomialNB())
])
 
tfidf_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(min_df=3, max_features=None,
                              strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
                              ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
                              stop_words='english')),
    ('mnb', MultinomialNB())
])
 
classifiers = [
    ("ngram", ngram_pipe),
    ("unigram", unigram_log_pipe),
    ("tfidf", tfidf_pipe),
]
 
mixed_pipe = Pipeline([
    ("voting", VotingClassifier(classifiers, voting="soft"))
])

Now we’re ready to work out which classifiers are needed. We’ll use GridSearchCV to do this.

from sklearn.model_selection import GridSearchCV
 
 
def combinations_on_off(num_classifiers):
    return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
            for i in range(1, 2 ** num_classifiers)]
 
 
param_grid = dict(
    voting__weights=combinations_on_off(len(classifiers))
)
 
train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
y = train_df[Y_COLUMN].copy()
X = pd.Series(train_df[TEXT_COLUMN])
 
grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")
 
grid_search.fit(X, y)
 
cv_results = grid_search.cv_results_
 
for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
    print(params, mean_score)
 
print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
    print("\t%s: %r" % (param_name, best_parameters[param_name]))

Let’s run the grid scan and see what it comes up with:

{'voting__weights': [0, 0, 1]} -0.60533660756
{'voting__weights': [0, 1, 0]} -0.474562462086
{'voting__weights': [0, 1, 1]} -0.508363479586
{'voting__weights': [1, 0, 0]} -0.697231760084
{'voting__weights': [1, 0, 1]} -0.456599644003
{'voting__weights': [1, 1, 0]} -0.409406571361
{'voting__weights': [1, 1, 1]} -0.439084397238
 
Best score: -0.409
Best parameters set:
voting__weights: [1, 1, 0]

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the ngram_pipe and unigram_log_pipe classifiers. tfidf_pipe should not be included; our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place.

Topics:
ai ,classification ,tutorial ,scikit-learn ,gridsearch ,tuning ,algorithm

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}