Python NLTK/Neo4j: Analysing the Transcripts of How I Met Your Mother
Join the DZone community and get the full member experience.
Join For FreeAfter reading Emil’sblog post about dark data a few weeks ago I became intrigued about trying to find some structure in free text data and I thought How I Met Your Mother’s transcripts would be a good place to start.
I found a website which has the transcripts for all the episodes and then having manually downloaded the two pages which listed all the episodes, wrote a script to grab each of the transcripts so I could use them on my machine.
I wanted to learn a bit of Python and my colleague Nigel pointed me towards the requests and BeautifulSoup libraries to help me with my task. The script to grab the transcripts looks like this:
import requests from bs4 import BeautifulSoup from soupselect import select episodes = {} for i in range(1,3): page = open("data/transcripts/page-" + str(i) + ".html", 'r') soup = BeautifulSoup(page.read()) for row in select(soup, "td.topic-titles a"): parts = row.text.split(" - ") episodes[parts[0]] = {"title": parts[1], "link": row.get("href")} for key, value in episodes.iteritems(): parts = key.split("x") season = int(parts[0]) episode = int(parts[1]) filename = "data/transcripts/S%d-Ep%d" %(season, episode) print filename with open(filename, 'wb') as handle: headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'} response = requests.get("transcript website" + value["link"], headers = headers) if response.ok: for block in response.iter_content(1024): if not block: break handle.write(block)
{Editor's Note: Upon request from the owner of the transcript website, we have removed specific references to that site.}
the files containing the lists of episodes are named ‘page-1′ and ‘page-2′
The code is reasonably simple – we find all the links inside the table, put them in a dictionary and then iterate through the dictionary and download the files to disk. The code to save the file is a bit of a monstrosity but there didn’t seem to be a ‘save’ method that I could use.
Having downloaded the files, I thought through all sorts of clever things I could do, including generating a bag of words model for each episode or performing sentiment analysis on each sentence which I’d learnt about from a Kaggle tutorial.
In the end I decided to start simple and extract all the words from the transcripts and count many times a word occurred in a given episode.
I ended up with the following script which created a dictionary of (episode -> words + occurrences):
import csv import nltk import re from bs4 import BeautifulSoup from soupselect import select from nltk.corpus import stopwords from collections import Counter from nltk.tokenize import word_tokenize def count_words(words): tally=Counter() for elem in words: tally[elem] += 1 return tally episodes_dict = {} with open('data/import/episodes.csv', 'r') as episodes: reader = csv.reader(episodes, delimiter=',') reader.next() for row in reader: print row transcript = open("data/transcripts/S%s-Ep%s" %(row[3], row[1])).read() soup = BeautifulSoup(transcript) rows = select(soup, "table.tablebg tr td.post-body div.postbody") raw_text = rows[0] [ad.extract() for ad in select(raw_text, "div.ads-topic")] [ad.extract() for ad in select(raw_text, "div.t-foot-links")] text = re.sub("[^a-zA-Z]", " ", raw_text.text.strip()) words = [w for w in nltk.word_tokenize(text) if not w.lower() in stopwords.words("english")] episodes_dict[row[0]] = count_words(words)
Next I wanted to explore the data a bit to see which words occurred across episodes or which word occurred most frequently and realised that this would be a much easier task if I stored the data somewhere.
s/somewhere/in Neo4j
Neo4j’s query language, Cypher, has a really nice ETL-esque tool called ‘LOAD CSV’ for loading in CSV files (as the name suggests!) so I added some code to save my words to disk:
with open("data/import/words.csv", "w") as words: writer = csv.writer(words, delimiter=",") writer.writerow(["EpisodeId", "Word", "Occurrences"]) for episode_id, words in episodes_dict.iteritems(): for word in words: writer.writerow([episode_id, word, words[word]])
This is what the CSV file contents look like:
$ head -n 10 data/import/words.csv EpisodeId,Word,Occurrences 165,secondly,1 165,focus,1 165,baby,1 165,spiders,1 165,go,4 165,apartment,1 165,buddy,1 165,Exactly,1 165,young,1
Now we need to write some Cypher to get the data into Neo4j:
// words LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row MERGE (word:Word {value: row.Word})
// episodes LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row MERGE (episode:Episode {id: TOINT(row.EpisodeId)})
// words to episodes LOAD CSV WITH HEADERS FROM "file:/Users/markneedham/projects/neo4j-himym/data/import/words.csv" AS row MATCH (word:Word {value: row.Word}) MATCH (episode:Episode {id: TOINT(row.EpisodeId)}) MERGE (word)-[:USED_IN_EPISODE {times: TOINT(row.Occurrences) }]->(episode);
Having done that we can write some simple queries to explore the words used in How I met your mother:
MATCH (word:Word)-[r:USED_IN_EPISODE]->(episode) RETURN word.value, COUNT(episode) AS episodes, SUM(r.times) AS occurrences ORDER BY occurrences DESC LIMIT 10 ==> +-------------------------------------+ ==> | word.value | episodes | occurrences | ==> +-------------------------------------+ ==> | "Ted" | 207 | 11437 | ==> | "Barney" | 208 | 8052 | ==> | "Marshall" | 208 | 7236 | ==> | "Robin" | 205 | 6626 | ==> | "Lily" | 207 | 6330 | ==> | "m" | 208 | 4777 | ==> | "re" | 208 | 4097 | ==> | "know" | 208 | 3489 | ==> | "Oh" | 197 | 3448 | ==> | "like" | 208 | 2498 | ==> +-------------------------------------+ ==> 10 rows
The main 5 characters occupy the top 5 positions which is probably what you’d expect. I’m not sure why ‘m’ and ‘re’ are in the next two position s – I expect that might be scraping gone wrong!
Our next query might focus around checking which character is referred to the post in each episode:
WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters MATCH (word:Word) WHERE word.value IN mainCharacters MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word) WITH episode, word, r ORDER BY episode.id, r.times DESC WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord RETURN episode.id, topWord.word AS word, topWord.times AS occurrences LIMIT 10 ==> +---------------------------------------+ ==> | episode.id | word | occurrences | ==> +---------------------------------------+ ==> | 72 | "Barney" | 75 | ==> | 143 | "Ted" | 16 | ==> | 43 | "Lily" | 74 | ==> | 156 | "Ted" | 12 | ==> | 206 | "Barney" | 23 | ==> | 50 | "Marshall" | 51 | ==> | 113 | "Ted" | 76 | ==> | 178 | "Barney" | 21 | ==> | 182 | "Barney" | 22 | ==> | 67 | "Ted" | 84 | ==> +---------------------------------------+ ==> 10 rows
If we dig into it further there’s actually quite a bit of variety in the number of times the top character in each episode is mentioned which again probably says something about the data:
WITH ["Ted", "Barney", "Robin", "Lily", "Marshall"] as mainCharacters MATCH (word:Word) WHERE word.value IN mainCharacters MATCH (episode:Episode)<-[r:USED_IN_EPISODE]-(word) WITH episode, word, r ORDER BY episode.id, r.times DESC WITH episode, COLLECT({word: word.value, times: r.times})[0] AS topWord RETURN MIN(topWord.times), MAX(topWord.times), AVG(topWord.times), STDEV(topWord.times) ==> +-------------------------------------------------------------------------------------+ ==> | MIN(topWord.times) | MAX(topWord.times) | AVG(topWord.times) | STDEV(topWord.times) | ==> +-------------------------------------------------------------------------------------+ ==> | 3 | 259 | 63.90865384615385 | 42.36255207691068 | ==> +-------------------------------------------------------------------------------------+ ==> 1 row
Obviously this is a very simple way of deriving structure from text, here are some of the things I want to try out next:
- Detecting common phrases/memes/phrases used in the show (e.g. the yellow umbrella) – this should be possible by creating different length n-grams and then searching for those phrases across the corpus.
- Pull out scenes – some of the transcripts use the keyword ‘scene’ to denote this although some of them don’t. Depending how many transcripts contain scene demarkations perhaps we could train a classifier to detect where scenes should be in the transcripts which don’t have scenes.
- Analyse who talks to each other or who talks about each other most frequently
- Create a graph of conversations as my colleagues Max and Michael have previously blogged about.
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Trending
-
Logging Best Practices Revisited [Video]
-
RBAC With API Gateway and Open Policy Agent (OPA)
-
Exploratory Testing Tutorial: A Comprehensive Guide With Examples and Best Practices
-
Cypress Tutorial: A Comprehensive Guide With Examples and Best Practices
Comments