Over a million developers have joined DZone.

Python: Detecting the Speaker in HIMYM Using Parts of Speech (POS) Tagging

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

Over the last couple of weeks I’ve been experimenting with differentclassifiers to detect speakers in HIMYM transcripts and in all my attempts so far the only features I’ve used have been words.

This led to classifiers that were overfitted to the training data so I wanted to generalise them by introducing parts of speech of the words in sentences which are more generic.

First I changed the function which generates the features for each word to also contain the parts of speech of the previous and next words as well as the word itself:

def pos_features(sentence, sentence_pos, i):
    features = {}
 
    features["word"] = sentence[i]
    features["word-pos"] = sentence_pos[i][1]
 
    if i == 0:
        features["prev-word"] = "<START>"
        features["prev-word-pos"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
        features["prev-word-pos"] = sentence_pos[i-1][1]
 
    if i == len(sentence) - 1:
        features["next-word"] = "<END>"
        features["next-word-pos"] = "<END>"
    else:
        features["next-word"] = sentence[i+1]
        features["next-word-pos"] = sentence_pos[i+1][1]
 
    return features

Next we need to tweak our calling code to calculate the parts of speech tags for each sentence and pass it in:

featuresets = []
for tagged_sent in tagged_sents:
    untagged_sent = nltk.tag.untag(tagged_sent)
    sentence_pos = nltk.pos_tag(untagged_sent)
    for i, (word, tag) in enumerate(tagged_sent):
        featuresets.append((pos_features(untagged_sent, sentence_pos, i), tag) )

I’m using nltk to do this and although it’s slower than some alternatives, the data set is small enough that it’s not an issue.

Now it’s time to train a Decision Tree with the new features. I created three variants – one with both words and POS; one with only words; one with only POS.

I took a deep copy of the training/test data sets and then removed the appropriate keys:

def get_rid_of(entry, *keys):
    for key in keys:
        del entry[key]
 
import copy
 
# Word based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word-pos', 'word-pos', 'next-word-pos')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)
 
# POS based classifier
tmp_train_data = copy.deepcopy(train_data)
for entry, tag in tmp_train_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
tmp_test_data = copy.deepcopy(test_data)
for entry, tag in tmp_test_data:
    get_rid_of(entry, 'prev-word', 'word', 'next-word')
 
c = nltk.DecisionTreeClassifier.train(tmp_train_data)
c.classify(tmp_test_data)

The full code is on my github but these were the results I saw:

$ time python scripts/detect_speaker.py
Classifier              speaker precision    speaker recall    non-speaker precision    non-speaker recall
--------------------  -------------------  ----------------  -----------------------  --------------------
Decision Tree All In             0.911765          0.939394                 0.997602              0.996407
Decision Tree Words              0.911765          0.939394                 0.997602              0.996407
Decision Tree POS                0.90099           0.919192                 0.996804              0.996008

There’s still not much in it – the POS one has slightly more false positives and false positives when classifying speakers but on other runs it performed better.

If we take a look at the decision tree that’s been built for the POS one we can see that it’s all about POS now as you’d expect:

>>> print(c.pseudocode(depth=2))
if next-word-pos == '

I like that it’s identified the ‘:‘ pattern:

if next-word-pos == ':':
  ...
  if prev-word-pos == '<START>': return True

Next I need to drill into the types of sentence structures that it’s failing on and work out some features that can handle those. I still need to see how well a random forest of decision trees would as well.

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

Topics:

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

SEE AN EXAMPLE
Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.
Subscribe

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}