Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Python: Detecting the Speaker in HIMYM Using Parts of Speech (POS) Tagging

DZone's Guide to

Python: Detecting the Speaker in HIMYM Using Parts of Speech (POS) Tagging

· Web Dev Zone
Free Resource

Learn how to build modern digital experience apps with Crafter CMS. Download this eBook now. Brought to you in partnership with Crafter Software

Over the last couple of weeks I’ve been experimenting with differentclassifiers to detect speakers in HIMYM transcripts and in all my attempts so far the only features I’ve used have been words.

This led to classifiers that were overfitted to the training data so I wanted to generalise them by introducing parts of speech of the words in sentences which are more generic.

First I changed the function which generates the features for each word to also contain the parts of speech of the previous and next words as well as the word itself:

def pos_features(sentence, sentence_pos, i):
    features = {}
 
    features["word"] = sentence[i]
    features["word-pos"] = sentence_pos[i][1]
 
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-word-pos"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-word-pos"] = sentence_pos[i-1][1]
 
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
        features["next-word-pos"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
        features["next-word-pos"] = sentence_pos[i+1][1]
 
    return features

Next we need to tweak our calling code to calculate the parts of speech tags for each sentence and pass it in:

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )

I’m using nltk to do this and although it’s slower than some alternatives, the data set is small enough that it’s not an issue.

Now it’s time to train a Decision Tree with the new features. I created three variants – one with both words and POS; one with only words; one with only POS.

I took a deep copy of the training/test data sets and then removed the appropriate keys:

def get_rid_of(entry, *keys):
    for key in keys:
        del entry[key]
 
import copy
 
# Word based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)
 
# POS based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

The full code is on my github but these were the results I saw:

$ time python scripts/detect_speaker.py
Classifier              speaker precision    speaker recall    non-speaker precision    non-speaker recall
--------------------  -------------------  ----------------  -----------------------  --------------------
Decision Tree All In             0.911765          0.939394                 0.997602              0.996407
Decision Tree Words              0.911765          0.939394                 0.997602              0.996407
Decision Tree POS                0.90099           0.919192                 0.996804              0.996008

There’s still not much in it – the POS one has slightly more false positives and false positives when classifying speakers but on other runs it performed better.

If we take a look at the decision tree that’s been built for the POS one we can see that it’s all about POS now as you’d expect:

>>> print(c.pseudocode(depth=2))
if next-word-pos == '

I like that it’s identified the ‘:‘ pattern:

if next-word-pos == ':':
  ...
  if prev-word-pos == '<START>': return True

Next I need to drill into the types of sentence structures that it’s failing on and work out some features that can handle those. I still need to see how well a random forest of decision trees would as well.

Crafter is a modern CMS platform for building modern websites and content-rich digital experiences. Download this eBook now. Brought to you in partnership with Crafter Software.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}