Over a million developers have joined DZone.

Python: Creating a Skewed Random Discrete Distribution

· Web Dev Zone

Start coding today to experience the powerful engine that drives data application’s development, brought to you in partnership with Qlik.

I’m planning to write a variant of the TF/IDF algorithm over the HIMYM corpus which weights in favour of term that appear in a medium number of documents and as a prerequisite needed a function that when given a number of documents would return a weighting.

It should return a higher value when a term appears in a medium number of documents i.e. if I pass in 10 I should get back a higher value than 200 as a term that appears in 10 episodes is likely to be more interesting than one which appears in almost every episode.

I went through a few different scipy distributions but none of them did exactly what I want so I ended up writing a sampling function which chooses random numbers in a range with different probabilities:

import math
import numpy asf np
values = range(1, 209)
probs = [1.0 / 208] * 208
for idx, prob in enumerate(probs):
    if idx > 3 and idx < 20:
        probs[idx] = probs[idx] * (1 + math.log(idx + 1))
    if idx > 20 and idx < 40:
        probs[idx] = probs[idx] * (1 + math.log((40 - idx) + 1))
probs = [p / sum(probs) for p in probs]
sample =  np.random.choice(values, 1000, p=probs)
>>> print sample[:10]
[ 33   9  22 126  54   4  20  17  45  56]

Now let’s visualise the distribution of this sample by plotting a histogram:

import matplotlib
import matplotlib.pyplot as plt
binwidth = 2
plt.hist(sample, bins=np.arange(min(sample), max(sample) + binwidth, binwidth))
plt.xlim([0, max(sample)])

2015 03 30 23 25 05

It’s a bit hacky but it does seem to work in terms of weighting the values correctly. It would be nice if it I could smooth it off a bit better but I’m not sure how at the moment.

One final thing we can do is get the count of any one of the values by using the bincount function:

>>> print np.bincount(sample)[1]
>>> print np.bincount(sample)[10]
>>> print np.bincount(sample)[206]

Now I need to plug this into the rest of my code and see if it actually works!

Create data driven applications in Qlik’s free and easy to use coding environment, brought to you in partnership with Qlik.

python ,bigdata ,tips and tricks

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

The best of DZone straight to your inbox.

Please provide a valid email address.

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}