Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

A Twitter Sentiment Analysis Pipeline for U.S. Airlines

DZone's Guide to

A Twitter Sentiment Analysis Pipeline for U.S. Airlines

Learn how to use a sentiment analysis pipeline to analyze and classify tweets from U.S. airlines and use data visualization to further understand the trends.

· AI Zone ·
Free Resource

Insight for I&O leaders on deploying AIOps platforms to enhance performance monitoring today. Read the Guide.

The goal of this work is to build a pipeline to classify tweets from U.S. airlines and show a possible dashboard to understand the customer satisfaction trends.

Pipeline Architecture

For this work, the pipeline is composed by:

  • Kafka to ingest tweets.
  • Python services to classify the tweets with a neural network.
  • Elasticsearch to store the results.
  • Kibana to create the dashboard.

Untitled Diagram

With Kafka, the tweets are ingested into the pipeline. In this way, the pipeline can handle a lot of tweets. An ingested tweet is pulled from the topic by one or more Python services that classify the tweet with the neural network trained in my last post. The neural network training results were good — the accuracy of prediction is about 94% — so this network could be production-ready for classification tasks.

When a tweet is classified, it is saved by the Python service into an Elasticsearch index. Over the index, a Kibana dashboard, allows the user to monitor American Airlines trends.

Save the Neural Network

To build this pipeline, it's necessary to do a little refactoring of the training routine written in the previous post because the neural network needs to be saved.

The refactoring looks like this:

from keras.callbacks import ModelCheckpoint

# checkpoint
filepath="weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='val_acc', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]

batch_size = 512
history = model.fit(partial_X_train, 
                    partial_Y_train, 
                    epochs = 10, 
                    batch_size=batch_size, 
                    callbacks = callbacks_list,
                    validation_data=(X_val, Y_val))

To save the neural network, it's necessary to declare a ModelCheckpoint and then define a callback to this checkpoint into model.fit. In this way, the neural network is saved only if the result is better than last saved model. After this step, the best trained neural network is saved in an hdf5 file and can be loaded by the pipeline.

Load the Neural Network

The pipeline core is one or more Python services that load the neural network and perform the tweet classification. This service can scale well because it's totally stateless: take one tweet from the Kafka topic and classify and write to Elasticsearch.

From the code point of view, the neural network can be loaded in a very, very, very simple way:

from keras.models import load_model

def load_nn(nn_hdf5_filename):
    model = load_model(nn_hdf5_filename)
    return model

def main():
    #load neural network model
    model = load_nn('weights.best.hdf5')

if __name__ == "__main__":
    main()

With load_module from  keras.models, it is possible to load an hdf5 file saved during the training phase.

Elasticsearch Index

An important step is to define the Elasticsearch index. To do this step in the right way, it's important to know what the use case is: the dashboard goal is to show the ratio of positive tweets and negative tweets and show the last tweets and how they are classified.

So, the fields defined are:

  • tweet: Field with type text to save the tweet and allow a possible full-text search.
  • Classification: Defined as keywords.
  • positive_classification_confidence: Float and will be used in the future.
  • negative_classification_confidence: Float and will be used in the future.

The full index definition is the following:

#####!/usr/bin/env /bin/bash
#! /bin/bash
INDEX_NAME="tweet-classification"
curl -XPUT "http://localhost:9200/$INDEX_NAME/" -d '{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "classified-tweet": {
      "_meta": {
        "version": "1.0"
      },
      "properties": {
        "tweet": {
            "type": "text"
        },
        "classification":{
            "type":"keywords"
        },
        "negative_classification_confidence":{
            "type":"long"
        },
        "positive_classification_confidence":{
            "type":"long"
        }
      }
    }
  }
}'

The Dashboard

The dashboard created is very simple:

Screenshot from 2018-04-27 16-47-58

The total number of positive and negative tweets can be seen in a pie chart, and there is a data table with the latest tweets and how they are classified by the neural network.

Conclusion

This use case is a toy use case: it's only an expedition to build a data analysis pipeline with a neural network, and play with some cool stuff like Elasticsearch, Kibana, and Kafka.

The dashboard shows very little information — only the tweet'ss classification — and this is why the dataset used for training the neural network is very little (only one month of tweets).

But if the data set were bigger, some interesting analysis could be done from the American Airlines point of view. For example:

  • Know what the main causes of a bad flight are. In the dataset, this information is available but there isn't enough data to train a good network. 
  • Know what the most problematic airports are.

Please feel free to let me know what you think about this post and how this work could be better.

TrueSight is an AIOps platform, powered by machine learning and analytics, that elevates IT operations to address multi-cloud complexity and the speed of digital transformation.

Topics:
sentiment analysis ,neural network ,elasticseach ,kibana ,kafka ,ai ,tutorial ,twitter

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}