scikit-learn: Creating a Matrix of Named Entity Counts
Here's how I tried to win Kaggle's Spooky Author Identification competition by building a model using named entities and using the polyglot NLP library.
Join the DZone community and get the full member experience.
Join For FreeI've been trying to improve my score on Kaggle's Spooky Author Identification competition, and my latest idea was building a model that used named entities extracted using the polyglot NLP library.
We'll start by learning how to extract entities form a sentence using polyglot, which isn't too tricky:
>>> from polyglot.text import Text
>>> doc = "My name is David Beckham. Hello from London, England"
>>> Text(doc, hint_language_code="en").entities
[I-PER(['David', 'Beckham']), I-LOC(['London']), I-LOC(['England'])]
This sentence contains three entities. We'd like each entity to be a string rather than an array of values, so let's refactor the code to do that:
>>> ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
['David_Beckham', 'London', 'England']
That's it for the polyglot part of the solution. Now let's work out how to integrate that with scikit-learn.
I've been using scikit-learn's abstraction for the other models I've created so I'd like to take the same approach here. This is an example of a model that creates a matrix of unigram counts and creates a Naive Bayes model on top of that:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
nlp_pipeline = Pipeline([
('cv', CountVectorizer(),
('mnb', MultinomialNB())
])
...
# Train and Test the model
...
I was going to write a class similar to CountVectorizer, but after reading its code for a couple of hours, I realized that I could just pass in a custom analyzer instead. This is what I ended up with:
entities = {}
def analyze(doc):
if doc not in entities:
entities[doc] = ["_".join(entity) for entity in Text(doc, hint_language_code="en").entities]
return entities[doc]
nlp_pipeline = Pipeline([
('cv', CountVectorizer(analyzer=lambda doc: analyze(doc))),
('mnb', MultinomialNB())
])
I'm caching the results in a dictionary because the entity extraction is quite time-consuming and there's no point recalculating it each time the function is called.
Unfortunately, this model didn't help me improve my best score. It scores a log loss of around 0.5 — a bit worse than the 0.45 I've achieved using the unigram model.
Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.
Opinions expressed by DZone contributors are their own.
Comments