Over a million developers have joined DZone.

Applying Natural Language Processing to Decode a Movie

DZone 's Guide to

Applying Natural Language Processing to Decode a Movie

Learn how to apply natural language processing to understand the relationship between various characters of Indian classic movie Sholay.

· AI Zone ·
Free Resource

Image title

The objective of this article is to apply natural language processing to understand the relationship between various characters of Indian classic movie Sholay. The text used is taken from the Wiki page for Sholay and only first three paragraphs are used for better understanding the context. The article focus is on method rather than accuracy, so a smaller dataset is used. Readers who are interested in the accuracy of various methods in the NLP space can look at this article.


In recent years, various methods to represent words in a vector space have come up that have proved to be effective in capturing syntactic (grammatical structure) and semantic (meaning) details. Word embedding is one of the techniques that helps in NLP-related tasks.

Word and phrase embeddings, when used as the underlying input representation, have been shown to boost the performance in NLP.

As per Wikipedia:

Word embedding is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

It is found that word vectors perform considerably well in explaining the relationships. This is clear from the following image.

Word Vectors

This is a useful feature of word vectors where they are able to explain a relationship like following:

vector(King) - vector(Queen) = vector(Man) - vector(Woman)

One of the notable methods that have seen good success is Word2Vec by Mikolov, et al. This was done by researchers at Google.

There is another notable method called GloVe (Global Vectors), developed at Stanford by Jeffrey Pennington, Richard Socher, and Christopher D. Manning, which appears to perform better in some contexts. 

Both of the above models follow the approach: A word is known by the company it keeps.

Note: There are other approaches, as well, which have been used widely and have contributed to the evolution of NLP, that are not discussed here.

Applicability of Word Vector Representation

Word vector representation is quite useful and helps in a number of machine learning tasks. Some of them are listed below.

  • Topic modeling: A technique to extract abstract topics from a collection of documents. (Learn more here.)
  • Document similarity: Information on document similarity can be retrieved from word vector representations. 
  • Vectorization: An important step in ML pipelines and text mining. 

Methodology Used for the Analysis

In this instance, the GloVe method is used to analyze the text taken from a Wikipedia page (whose first three paragraphs are taken). The vectors formed from the GloVe algorithm are then subjected to principal component analysis and then shown on a two-dimensional graph with various phrases that appeared in the Wiki text.

Technical Environment

The code is written in R in an RStudio environment.

Word embeddings are learned by using a text2vecand tm package. Visualization is done using ggplot2, which is shown in a two-dimensional space.


The vocabulary is trimmed to remove the words whose count is less than three. This is to keep the visualization uncluttered.

The window of words is kept at 5.

Size of vectors is kept at 20.

The model is trained as by calling glove = GlobalVectors$new,available in the text2vec package.

Word Vectors Snapshot

Following is an example of various vectors learned by the above technique for some of the words appearing in the corpus.

Word vectors learned by GloVe

Note: All vectors are not shown due to a space crunch.

Relationship Between Phrases

After the vectors are learned, principal component analysis is applied and various words are drawn on a two-dimensional plot with two principal components, as it is difficult to visualize more than two dimensions. The resulting relationships are described by the following image.

Image title

Some interesting observations are found:

1. Jai and Veeru appear together.

2. They appear far from the name of the movie Sholay. This could be due to an insufficient dataset.

3. Basanti, Sholay, and Ramgarh appear together.

4. The distances between Veeru/Basanti and Jai/Radha are comparable.


We see that on a small dataset, global vectors perform reasonably well in understanding some of the relationships. With the help of vector representation, words can be represented as numerical vectors and can be used as building blocks for some of the more complex machine learning tasks.

The source code and explanation of above analysis is available here.

natural language processing ,machine learning ,ai ,word embeddings ,word vectors ,word2vec ,tutorial

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}