Over a million developers have joined DZone.

TensorFlow: Kaggle Spooky Authors Bag of Words Model

DZone's Guide to

TensorFlow: Kaggle Spooky Authors Bag of Words Model

I recently submitted to Kaggle's Spooky Author Identification competition based on a text classification tutorial. Here's how I built my model!

· AI Zone ·
Free Resource

Bias comes in a variety of forms, all of them potentially damaging to the efficacy of your ML algorithm. Read how Alegion's Chief Data Scientist discusses the source of most headlines about AI failures here.

I've been playing around with some TensorFlow tutorials recently and wanted to see if I could create a submission for Kaggle's Spooky Author Identification competition, which I've written about recently.

My model is based on one from the text classification tutorial. The tutorial shows how to create custom Estimators, which you can learn more about in a post on the Google Developers blog.


Let's get started. First, our imports:

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

We've obviously got Tensorflow, but also scikit-learn, which we'll use to split our data into a training and test sets as well as to convert the author names into numeric values.

Model-Building Functions

Next, we'll create a function to create a bag of words model. This function calls another one that creates different EstimatorSpecs depending on the context it's called from.

WORDS_FEATURE = 'words'  # Name of the input words feature.

def bag_of_words_model(features, labels, mode):
    bow_column = tf.feature_column.categorical_column_with_identity(WORDS_FEATURE, num_buckets=n_words)
    bow_embedding_column = tf.feature_column.embedding_column(bow_column, dimension=EMBEDDING_SIZE)
    bow = tf.feature_column.input_layer(features, feature_columns=[bow_embedding_column])
    logits = tf.layers.dense(bow, MAX_LABEL, activation=None)
    return create_estimator_spec(logits=logits, labels=labels, mode=mode)

def create_estimator_spec(logits, labels, mode):
    predicted_classes = tf.argmax(logits, 1)
    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(
                'class': predicted_classes,
                'prob': tf.nn.softmax(logits),
                'log_loss': tf.nn.softmax(logits),

    loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
    if mode == tf.estimator.ModeKeys.TRAIN:
        optimizer = tf.train.AdamOptimizer(learning_rate=0.01)
        train_op = optimizer.minimize(loss, global_step=tf.train.get_global_step())
        return tf.estimator.EstimatorSpec(mode, loss=loss, train_op=train_op)

    eval_metric_ops = {
        'accuracy': tf.metrics.accuracy(labels=labels, predictions=predicted_classes)
    return tf.estimator.EstimatorSpec(mode=mode, loss=loss, eval_metric_ops=eval_metric_ops)

Loading Data

Now, we're ready to load our data.

The only interesting thing here is the LabelEncoder. We'll keep that around, as we'll use it later, as well.

Y_COLUMN = "author"
TEXT_COLUMN = "text"
le = preprocessing.LabelEncoder()

train_df = pd.read_csv("train.csv")
X = pd.Series(train_df[TEXT_COLUMN])
y = le.fit_transform(train_df[Y_COLUMN].copy())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

Transforming Documents

At the moment, our training and test DataFrames contain text, but Tensorflow works with vectors, so we need to convert our data into that format. We can use the VocabularyProcessor to do this:

vocab_processor = tf.contrib.learn.preprocessing.VocabularyProcessor(MAX_DOCUMENT_LENGTH)

X_transform_train = vocab_processor.fit_transform(X_train)
X_transform_test = vocab_processor.transform(X_test)

X_train = np.array(list(X_transform_train))
X_test = np.array(list(X_transform_test))

n_words = len(vocab_processor.vocabulary_)
print('Total words: %d' % n_words)

Training Our Model

Finally, we're ready to train our model! We'll call the bag of words model that we created at the beginning and build a train input function where we pass in the training arrays that we just created:

model_fn = bag_of_words_model
classifier = tf.estimator.Estimator(model_fn=model_fn)

train_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={WORDS_FEATURE: X_train},
classifier.train(input_fn=train_input_fn, steps=100)

Evaluating Our Model

Let's see how our model fares. We'll call the evaluate function with our test data:

INFO:tensorflow:Saving checkpoints for 1 into /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt.
INFO:tensorflow:loss = 1.0888131, step = 1
INFO:tensorflow:Saving checkpoints for 100 into /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt.
INFO:tensorflow:Loss for final step: 0.18394235.
INFO:tensorflow:Starting evaluation at 2018-01-28-22:41:34
INFO:tensorflow:Restoring parameters from /var/folders/k5/ssmkw9vd2yb3h5wnqlxnqbkw0000gn/T/tmpb6v4rrrn/model.ckpt-100
INFO:tensorflow:Finished evaluation at 2018-01-28-22:41:34
INFO:tensorflow:Saving dict for global step 100: accuracy = 0.8246673, global_step = 100, loss = 0.44942895
Accuracy: 0.824667, Loss 0.449429

Not too bad! I managed to get a log loss score of ~0.36 with a scikit-learn ensemble model, but it is better than some of my first attempts.

Generating Predictions

I wanted to see how it'd do against Kaggle's test dataset, so I generated a CSV file with predictions:

test_df = pd.read_csv("test.csv")

X_test = pd.Series(test_df[TEXT_COLUMN])
X_test = np.array(list(vocab_processor.transform(X_test)))

test_input_fn = tf.estimator.inputs.numpy_input_fn(
    x={WORDS_FEATURE: X_test},

predictions = classifier.predict(test_input_fn)
y_predicted_classes = np.array(list(p['prob'] for p in predictions))

output = pd.DataFrame(y_predicted_classes, columns=le.classes_)
output["id"] = test_df["id"]
output.to_csv("output.csv", index=False, float_format='%.6f')

Here we go:

The score is roughly the same as we saw with the test split of the training set. If you want to see all the code in one place, I've put it on my Spooky Authors GitHub repository.

Your machine learning project needs enormous amounts of training data to get to a production-ready confidence level. Get a checklist approach to assembling the combination of technology, workforce and project management skills you’ll need to prepare your own training data.

ai ,tensorflow ,deep learning ,text classification ,tutorial ,predictive analytics

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}