Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

DZone's Guide to

# scikit-learn: Building a Multi-Class Classification Ensemble

### Learn about using a classification algorithm and predictive analytics to predict what sentence was written by which author.

· AI Zone ·
Free Resource

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

The most visionary programmers today dream of what a robot could do, just like their counterparts in 1976 dreamed of what personal computers could do. Read more on MistyRobotics.com and enter to win your own Misty.

For the Kaggle Spooky Author Identification competition, I wanted to combine multiple classifiers together into an ensemble — and found the VotingClassifier that does exactly that.

We need to predict the probability that a sentence is written by one of three authors, so the VotingClassifier needs to make a "soft" prediction. If we only needed to know the most likely author, we could have it make a "hard" prediction instead.

We start with three classifiers, which generate different n-gram based features. The code for those is as follows:

``````from sklearn import linear_model
from sklearn.ensemble import VotingClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

ngram_pipe = Pipeline([
('cv', CountVectorizer(ngram_range=(1, 2))),
('mnb', MultinomialNB())
])

unigram_log_pipe = Pipeline([
('cv', CountVectorizer()),
('logreg', linear_model.LogisticRegression())
])``````

We can combine those classifiers together like this:

``````classifiers = [
("ngram", ngram_pipe),
("unigram", unigram_log_pipe),
]

mixed_pipe = Pipeline([
("voting", VotingClassifier(classifiers, voting="soft"))
])``````

Now, it’s time to test our ensemble. I got the code for the test function from Sohier Dane's tutorial.

``````import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold
from sklearn import metrics

Y_COLUMN = "author"
TEXT_COLUMN = "text"

def test_pipeline(df, nlp_pipeline):
y = df[Y_COLUMN].copy()
X = pd.Series(df[TEXT_COLUMN])
rskf = StratifiedKFold(n_splits=5, random_state=1)
losses = []
accuracies = []
for train_index, test_index in rskf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
nlp_pipeline.fit(X_train, y_train)
losses.append(metrics.log_loss(y_test, nlp_pipeline.predict_proba(X_test)))
accuracies.append(metrics.accuracy_score(y_test, nlp_pipeline.predict(X_test)))

print("{kfolds log losses: {0}, mean log loss: {1}, mean accuracy: {2}".format(
str([str(round(x, 3)) for x in sorted(losses)]),
round(np.mean(losses), 3),
round(np.mean(accuracies), 3)
))

test_pipeline(train_df, mixed_pipe)``````

Let’s run the script:

``kfolds log losses: ['0.388', '0.391', '0.392', '0.397', '0.398'], mean log loss: 0.393 mean accuracy: 0.849``

Looks good.

I’ve actually got several other classifiers, as well, but I’m not sure which ones should be part of the ensemble. In a future post, we’ll look at how to use GridSearch to work that out.

Robot Development Platforms: What the heck is ROS and are there any frameworks to make coding a robot easier? Read more on MistyRobotics.com

Topics:
ai ,tutorial ,classification ,algorithm ,scikit-learn ,predictive analytics

Comment (0)

Save
{{ articles[0].views | formatCount}} Views

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.