# scikit-learn: Building a Multi-Class Classification Ensemble

### Learn about using a classification algorithm and predictive analytics to predict what sentence was written by which author.

· AI Zone · Tutorial
Save
5.31K Views

For the Kaggle Spooky Author Identification competition, I wanted to combine multiple classifiers together into an ensemble — and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors, so the VotingClassifier needs to make a "soft" prediction. If we only needed to know the most likely author, we could have it make a "hard" prediction instead.

We start with three classifiers, which generate different n-gram based features. The code for those is as follows:

``````from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

ngram_pipe = Pipeline([
('cv', CountVectorizer(ngram_range=(1, 2))),
('mnb', MultinomialNB())
])

unigram_log_pipe = Pipeline([
('cv', CountVectorizer()),
('logreg', linear_model.LogisticRegression())
])``````

We can combine those classifiers together like this:

``````classifiers = [
("ngram", ngram_pipe),
("unigram", unigram_log_pipe),
]

mixed_pipe = Pipeline([
("voting", VotingClassifier(classifiers, voting="soft"))
])``````

Now, it’s time to test our ensemble. I got the code for the test function from Sohier Dane's tutorial.

``````import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

Y_COLUMN = "author"
TEXT_COLUMN = "text"

def test_pipeline(df, nlp_pipeline):
y = df[Y_COLUMN].copy()
X = pd.Series(df[TEXT_COLUMN])
rskf = StratifiedKFold(n_splits=5, random_state=1)
losses = []
accuracies = []
for train_index, test_index in rskf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
nlp_pipeline.fit(X_train, y_train)
losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))

print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
str([str(round(x, 3)) for x in sorted(losses)]),
round(np.mean(losses), 3),
round(np.mean(accuracies), 3)
))

train_df = pd.read_csv("train.csv", usecols=[Y_COLUMN, TEXT_COLUMN])
test_pipeline(train_df, mixed_pipe)``````

Let’s run the script:

``kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849``

Looks good.

I’ve actually got several other classifiers, as well, but I’m not sure which ones should be part of the ensemble. In a future post, we’ll look at how to use GridSearch to work that out.

Topics:
ai, tutorial, classification, algorithm, scikit-learn, predictive analytics

Published at DZone with permission of Mark Needham, DZone MVB.

Opinions expressed by DZone contributors are their own.

Comments