Movie Review Sentiment AnalysisA movie review website allows users to submit reviews describing what they either liked or disliked about a particular movie. Being able to mine these reviews and generate valuable meta data that describes its content provides an opportunity to understand the general sentiment around that movie in a democratized way. That’s a pretty cool thing if you think about it. Using machine learning we can democratize subjectivity about anything in the world. We can make an objective analysis of subjective content, giving us the ability to better understand trends around products and services that we can use to make better decisions as consumers.
Sentiment Analysis Data ModelOne of the major barriers to unlocking this ability is in the way we structure and transform our data. The current state-of-the-art methods include approaches such as Naive Bayes, Support Vector Machines, and Maximum Entropy. The challenges imposed by these approaches still remains in how features are extracted from a text and structured as data in a way that is least costly in terms of performance. I decided to focus on solving the problem of performance, in the way features are selected and extracted, and the availability of that data as the number of features grow over time.
Using a feature selection algorithm I describe here, I used the Graph Database Neo4j to solve the challenge of data transformation and availability. While the state of the art natural language parsing algorithms are focused on sentence structure, I’ve decided to pursue a statistical approach to natural language grammar induction. My approach focuses on generalizations across a vast corpus of text, generating new features using deep learning to predict features with the highest probability of being present to the left or right of a new feature.
Graph-based NLP ExampleLet’s assume that the phrase “one of the worst” has been extracted as a feature of a set of texts. The reason that this phrase was extracted was that a phrase that it was descended from had determined that this particular phrase was the most statistically relevant, meaning that the phrase had the best chance of being matched after the parent phrase. Using Neo4j we can determine the line of inheritance that produced this phrase as a feature.
The hierarchy reveals more possibilities as you move deeper from “one of the worst”. Expanding the path seen in the image above to include all possible features that descend from the phrase “one of the worst” reveals the following:
The result of this algorithm, and largely thanks to Neo4j’s graph traversals, is that any natural language text can be parsed with sub second performance and generate a subgraph to be used for whichever classification algorithm makes the most sense for your dataset.
Open Source Demo
For the movie review example I took 500 movie reviews for both negative and positive labels and trained a natural language parsing model in Neo4j using Graphify. In the next blog post in this series I will introduce you to a demo that can perform better at classifying movie reviews than a human. The human classification error being 0.3, or 70% success.
If you’re looking to get your feet wet before then, take a look at the finished demo here: Graphify Sentiment Analysis for Movie Reviews
NotesNeo4j is an open source graph database.
Graphify is an open source extension to Neo4j that extends Neo4j to include classification-based algorithms for natural language processing.