Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

scikit-learn: Creating a Matrix of Named Entity Counts

DZone's Guide to

scikit-learn: Creating a Matrix of Named Entity Counts

Here's how I tried to win Kaggle's Spooky Author Identification competition by building a model using named entities and using the polyglot NLP library.

· Big Data Zone ·
Free Resource

The open source HPCC Systems platform is a proven, easy to use solution for managing data at scale. Visit our Easy Guide to learn more about this completely free platform, test drive some code in the online Playground, and get started today.

I've been trying to improve my score on Kaggle's Spooky Author Identification competition, and my latest idea was building a model that used named entities extracted using the polyglot NLP library.

We'll start by learning how to extract entities form a sentence using polyglot, which isn't too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We'd like each entity to be a string rather than an array of values, so let's refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That's it for the polyglot part of the solution. Now let's work out how to integrate that with scikit-learn.

I've been using scikit-learn's abstraction for the other models I've created so I'd like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
])

...
# Train and Test the model
...

I was going to write a class similar to CountVectorizer, but after reading its code for a couple of hours, I realized that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}


def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())
])

I'm caching the results in a dictionary because the entity extraction is quite time-consuming and there's no point recalculating it each time the function is called.

Unfortunately, this model didn't help me improve my best score. It scores a log loss of around 0.5 — a bit worse than the 0.45 I've achieved using the unigram model.

Managing data at scale doesn’t have to be hard. Find out how the completely free, open source HPCC Systems platform makes it easier to update, easier to program, easier to integrate data, and easier to manage clusters. Download and get started today.

Topics:
big data ,python ,sci-kit learn ,tutorial ,polyglot

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}