Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

scikit-learn: Creating a Matrix of Named Entity Counts

DZone's Guide to

scikit-learn: Creating a Matrix of Named Entity Counts

Here's how I tried to win Kaggle's Spooky Author Identification competition by building a model using named entities and using the polyglot NLP library.

· Big Data Zone ·
Free Resource

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

I've been trying to improve my score on Kaggle's Spooky Author Identification competition, and my latest idea was building a model that used named entities extracted using the polyglot NLP library.

We'll start by learning how to extract entities form a sentence using polyglot, which isn't too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We'd like each entity to be a string rather than an array of values, so let's refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That's it for the polyglot part of the solution. Now let's work out how to integrate that with scikit-learn.

I've been using scikit-learn's abstraction for the other models I've created so I'd like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())
])

...
# Train and Test the model
...

I was going to write a class similar to CountVectorizer, but after reading its code for a couple of hours, I realized that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}


def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())
])

I'm caching the results in a dictionary because the entity extraction is quite time-consuming and there's no point recalculating it each time the function is called.

Unfortunately, this model didn't help me improve my best score. It scores a log loss of around 0.5 — a bit worse than the 0.45 I've achieved using the unigram model.

Hortonworks Sandbox for HDP and HDF is your chance to get started on learning, developing, testing and trying out new features. Each download comes preconfigured with interactive tutorials, sample data and developments from the Apache community.

Topics:
big data ,python ,sci-kit learn ,tutorial ,polyglot

Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}