# scikit-learn: Using GridSearch to Tune the Hyperparameters of VotingClassifier

### When building a classification ensemble, you need to be sure that the right classifiers are being included and the wrong ones are being excluded. Here's how to do that.

In my last post, I showed how to create a multi-class classification ensemble using scikit-learn’s `VotingClassifier` and finished mentioning that I didn’t know which classifiers should be part of the ensemble.

We need to get a better score with each of the classifiers in the ensemble — otherwise, they can be excluded.

We have a TF/IDF-based classifier as well as well as the classifiers I wrote about in the last post. This is the code describing the classifiers:

``````import pandas as pd
from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

Y_COLUMN = "author"
TEXT_COLUMN = "text"

unigram_log_pipe = Pipeline([
('cv', CountVectorizer()),
('logreg', linear_model.LogisticRegression())
])

ngram_pipe = Pipeline([
('cv', CountVectorizer(ngram_range=(1, 2))),
('mnb', MultinomialNB())
])

tfidf_pipe = Pipeline([
('tfidf', TfidfVectorizer(min_df=3, max_features=None,
strip_accents='unicode', analyzer='word', token_pattern=r'\w{1,}',
ngram_range=(1, 3), use_idf=1, smooth_idf=1, sublinear_tf=1,
stop_words='english')),
('mnb', MultinomialNB())
])

classifiers = [
("ngram", ngram_pipe),
("unigram", unigram_log_pipe),
("tfidf", tfidf_pipe),
]

mixed_pipe = Pipeline([
("voting", VotingClassifier(classifiers, voting="soft"))
])``````

Now we’re ready to work out which classifiers are needed. We’ll use `GridSearchCV` to do this.

``````from sklearn.model_selection import GridSearchCV

def combinations_on_off(num_classifiers):
return [[int(x) for x in list("{0:0b}".format(i).zfill(num_classifiers))]
for i in range(1, 2 ** num_classifiers)]

param_grid = dict(
voting__weights=combinations_on_off(len(classifiers))
)

train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
y = train_df[Y_COLUMN].copy()
X = pd.Series(train_df[TEXT_COLUMN])

grid_search = GridSearchCV(mixed_pipe, param_grid=param_grid, n_jobs=-1, verbose=10, scoring="neg_log_loss")

grid_search.fit(X, y)

cv_results = grid_search.cv_results_

for mean_score, params in zip(cv_results["mean_test_score"], cv_results["params"]):
print(params, mean_score)

print("Best score: %0.3f" % grid_search.best_score_)
print("Best parameters set:")
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(param_grid.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))``````

Let’s run the grid scan and see what it comes up with:

``````{'voting__weights': [0, 0, 1]} -0.60533660756
{'voting__weights': [0, 1, 0]} -0.474562462086
{'voting__weights': [0, 1, 1]} -0.508363479586
{'voting__weights': [1, 0, 0]} -0.697231760084
{'voting__weights': [1, 0, 1]} -0.456599644003
{'voting__weights': [1, 1, 0]} -0.409406571361
{'voting__weights': [1, 1, 1]} -0.439084397238

Best score: -0.409
Best parameters set:
voting__weights: [1, 1, 0]``````

We can see from the output that we’ve tried every combination of each of the classifiers. The output suggests that we should only include the `ngram_pipe` and `unigram_log_pipe` classifiers. `tfidf_pipe` should not be included; our log loss score is worse when it is added.

The code is on GitHub if you want to see it all in one place.

Topics:
ai ,classification ,tutorial ,scikit-learn ,gridsearch ,tuning ,algorithm

