Over a million developers have joined DZone.

Scikit-Learn: First Steps With log_loss

This article covers Python SciKit-Learn machine learning with log_loss algorithm for solving hard kaggle machine learning problems.

· Big Data Zone

Read this eGuide to discover the fundamental differences between iPaaS and dPaaS and how the innovative approach of dPaaS gets to the heart of today’s most pressing integration problems, brought to you in partnership with Liaison.

Over the last week, I’ve spent a little bit of time playing around with the data in the Kaggle TalkingData Mobile User Demographics competition and came across a notebook written by dune_dweller showing how to run a logistic regression algorithm on the dataset.

The metric used to evaluate the output in this competition is multi-class logarithmic loss, which is implemented by the log_loss function in the scikit-learn library.

I’ve not used it before so I created a small example to get to grips with it.

Let’s say we have 3 rows to predict and we happen to know that they should be labelled ‘bam’, ‘spam’, and ‘ham’ respectively:

>>> actual_labels = ["bam", "ham", "spam"]

To work out the log loss score, we need to make a prediction for what we think each label actually is. We do this by passing an array containing a probability between 0-1 for each label

E.g. if we think the first label is definitely ‘bam’ then we’d pass [1, 0, 0], whereas if we thought it had a 50-50 chance of being ‘bam’ or ‘spam’ then we might pass [0.5, 0, 0.5]. As far as I can tell the values get sorted into (alphabetical) order so we need to provide our predictions in the same order.

Let’s give it a try. First, we’ll import the function:

>>> from sklearn.metrics import log_loss

Now let’s see what score we get if we make a perfect prediction:

>>> log_loss(actual_labels,  [[1, 0, 0], [0, 1, 0], [0, 0, 1]])

What about if we make a completely wrong prediction?

>>> log_loss(actual_labels,  [[0, 0, 1], [1, 0, 0], [0, 1, 0]])

We can reverse engineer this score to work out the probability that we’ve predicted the correct class.

If we look at the case where the average log loss exceeds 1, it is when log(pij)

This is the formula of logloss:


In which yij is 1 for the correct class and 0 for other classes and pij is the probability assigned for that class.

The interesting thing about this formula is that we only care about the correct class. The yij value of 0 cancels out the wrong classes.

In our two examples so far we actually already know the probability estimate for the correct class – 100% in the first case and 0% in the second case, but we can plug in the numbers to check we end up with the same result.

First, we need to work out what value would have been passed to the log function which is easy in this case. The value of yij is:

# every prediction exactly right
>>> math.log(1)

>>> math.exp(0)
# every prediction completely wrong
>>> math.log(0.000000001)

>>> math.exp(-20.72326583694641)
I used a really small value instead of 0 in the second example because math.log(0) trends towards negative infinity.

Let’s try another example where we have less certainty:

>>> print log_loss(actual_labels, [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])
We’ll have to do a bit more work to figure out what value was being passed to the log function this time, but not too much. This is roughly the calculation being performed:
# 0.363548039673 = -1/3 * (log(0.8) + log(0.6) + log(0.7)

>>> print log_loss(actual_labels,  [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7]])

In this case, on average our probability estimate would be:

# we put in the negative value since we multiplied by -1/N
>>> math.exp(-0.363548039673)

We had 60%, 70%, and 80% accuracy for our 3 labels so an overall probability of 69.5% seems about right.

One more example. This time, we’ll make one more very certain (90%) prediction for ‘spam’:

>>> print log_loss(["bam", "ham", "spam", "spam"], [[0.8, 0.1, 0.1], [0.3, 0.6, 0.1], [0.15, 0.15, 0.7], [0.05, 0.05, 0.9]])

>>> math.exp(-0.299001158669)

74% accuracy overall, sounds about right!

Discover the unprecedented possibilities and challenges, created by today’s fast paced data climate and why your current integration solution is not enough, brought to you in partnership with Liaison


Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}