Over a million developers have joined DZone.

scikit-learn: Creating a Matrix of Named Entity Counts

DZone's Guide to

scikit-learn: Creating a Matrix of Named Entity Counts

Here's how I tried to win Kaggle's Spooky Author Identification competition by building a model using named entities and using the polyglot NLP library.

· Big Data Zone
Free Resource

Learn best practices according to DataOps. Download the free O'Reilly eBook on building a modern Big Data platform.

I've been trying to improve my score on Kaggle's Spooky Author Identification competition, and my latest idea was building a model that used named entities extracted using the polyglot NLP library.

We'll start by learning how to extract entities form a sentence using polyglot, which isn't too tricky:

>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]

This sentence contains three entities. We'd like each entity to be a string rather than an array of values, so let's refactor the code to do that:

>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']

That's it for the polyglot part of the solution. Now let's work out how to integrate that with scikit-learn.

I've been using scikit-learn's abstraction for the other models I've created so I'd like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(),
    ('mnb', MultinomialNB())

# Train and Test the model

I was going to write a class similar to CountVectorizer, but after reading its code for a couple of hours, I realized that I could just pass in a custom analyzer instead. This is what I ended up with:

entities = {}

def analyze(doc):
    if doc not in entities:
        entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
    return entities[doc]

nlp_pipeline = Pipeline([
    ('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
    ('mnb', MultinomialNB())

I'm caching the results in a dictionary because the entity extraction is quite time-consuming and there's no point recalculating it each time the function is called.

Unfortunately, this model didn't help me improve my best score. It scores a log loss of around 0.5 — a bit worse than the 0.45 I've achieved using the unigram model.

Find the perfect platform for a scalable self-service model to manage Big Data workloads in the Cloud. Download the free O'Reilly eBook to learn more.

big data ,python ,sci-kit learn ,tutorial ,polyglot

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}