Over a million developers have joined DZone.
{{announcement.body}}
{{announcement.title}}

Python: Creating a Skewed Random Discrete Distribution

DZone's Guide to

Python: Creating a Skewed Random Discrete Distribution

· Web Dev Zone
Free Resource

Try RAD Studio for FREE!  It’s the fastest way to develop cross-platform Native Apps with flexible Cloud services and broad IoT connectivity. Start Your Trial Today!

I’m planning to write a variant of the TF/IDF algorithm over the HIMYM corpus which weights in favour of term that appear in a medium number of documents and as a prerequisite needed a function that when given a number of documents would return a weighting.

It should return a higher value when a term appears in a medium number of documents i.e. if I pass in 10 I should get back a higher value than 200 as a term that appears in 10 episodes is likely to be more interesting than one which appears in almost every episode.

I went through a few different scipy distributions but none of them did exactly what I want so I ended up writing a sampling function which chooses random numbers in a range with different probabilities:

import math
import numpy asf np
 
values = range(1, 209)
probs = [1.0 / 208] * 208
 
for idx, prob in enumerate(probs):
    if idx > 3 and idx < 20:
        probs[idx] = probs[idx] * (1 + math.log(idx + 1))
    if idx > 20 and idx < 40:
        probs[idx] = probs[idx] * (1 + math.log((40 - idx) + 1))
 
probs = [p / sum(probs) for p in probs]
sample =  np.random.choice(values, 1000, p=probs)
 
>>> print sample[:10]
[ 33   9  22 126  54   4  20  17  45  56]

Now let’s visualise the distribution of this sample by plotting a histogram:

import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
 
binwidth = 2
plt.hist(sample, bins=np.arange(min(sample), max(sample) + binwidth, binwidth))
plt.xlim([0, max(sample)])
plt.show()

2015 03 30 23 25 05

It’s a bit hacky but it does seem to work in terms of weighting the values correctly. It would be nice if it I could smooth it off a bit better but I’m not sure how at the moment.

One final thing we can do is get the count of any one of the values by using the bincount function:

>>> print np.bincount(sample)[1]
4
 
>>> print np.bincount(sample)[10]
16
 
>>> print np.bincount(sample)[206]
3

Now I need to plug this into the rest of my code and see if it actually works!

Get Your Apps to Customers 5X Faster with RAD Studio. Brought to you in partnership with Embarcadero.

Topics:
python ,bigdata ,tips and tricks

Published at DZone with permission of Mark Needham, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

THE DZONE NEWSLETTER

Dev Resources & Solutions Straight to Your Inbox

Thanks for subscribing!

Awesome! Check your inbox to verify your email so you can start receiving the latest in tech news and resources.

X

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}