As I’ve started learning more about Bayesian statistical methods, the biggest light bulb moment for me was when I started thinking in terms of distributions. I was raised on the notion of single number answers. Because of that, I thought I had mastered Bayesian methods when I could recite P(A|B) = P(B|A) P(A) / P(B). WRONG.
Today’s algorithm will show a simple example of how to deal with probabilities as distributions. There are two key concepts: updating our distribution with data, and normalizing the distribution.
First allow me to introduce today’s concrete example: flipping a coin!
Chance of Heads
If you flip a coin and it comes up tails three times out of four, how likely is it that your coin is actually a fair coin? Let’s say fair means 45 to 55 percent of the time it comes up heads.
The chance it could come up heads could be anything between 0 to 100 percent, right? Zero percent would mean it’s completely unfair and somehow NEVER comes up heads. One hundred percent would be precisely the opposite. We’re hoping for somewhere in the middle.
Now, before we’ve flipped the coin, we need to decide what the probabilities of our hypotheses between 0 and 100 percent are. To keep it simple, we’re going to assume that each hypothesis is equally likely. This is known as a “uniform distribution.” In Python, that looks like this:
possibilities = [1.0]*101
Now, we need to normalize this so that it represents probabilities. For that, I wrote this function:
def normalize(possibilities): possibility_sum = sum(possibilities) for hypothesis in xrange(0,101): possibility = possibilities[hypothesis] possibilities[hypothesis] = possibility/possibility_sum
Basically, this just makes sure all of the likelihoods for our hypotheses add up to 100 percent like good little probabilities.
This is what this uniform distribution looks like visually:
Updating with New Information
So, now we flip our coin, and let’s say we get heads. Let’s think through what we expect to happen to that uniform distribution above. Starting from 0 percent and working right …
Zero percent: Now that we’ve gotten heads once, we know that it’s impossible to never get heads. This goes to zero.
One percent: The chances seem very low that, given that there’s only a 1 percent chance of getting heads, it’d be the first we’d run into.
Fifty percent: Completely possible. We’ve only flipped once, so we can’t have 50 percent probability. This should be somewhere in the middle.
One hundred percent: Well, we’ve only flipped heads once, so we can’t say this is certain, but we also can’t rule this out.
So, how do we update our data in Python?
We use this function when we flip heads:
def update_success(possibilities): for hypothesis in xrange(0, 101): likelihood = possibilities[hypothesis] possibilities[hypothesis] = likelihood * hypothesis/100.0 normalize(possibilities)
And this function when we flip tails:
def update_failure(possibilities): for hypothesis in xrange(0, 101): likelihood = possibilities[hypothesis] possibilities[hypothesis] = likelihood*(1.0-hypothesis/100.0) normalize(possibilities)
They are different in one subtle way. Look closely at the fourth line of each function.
On the success function, we are multiplying the likelihood by each hypothesized probability of us getting heads. Once we’ve done that for every hypothesis, we normalize to get back to a percentage. The failure function multiplies the likelihood of each hypothesis by the probability of the hypothesis being false (according to the hypothesis).
This leaves us with the following histogram of likelihoods for each hypothesis:
That’s all of the moving parts. Let’s walk through some more examples so you can convince yourself that this works.
Baby-stepping Through Our Scenario
Next, let’s say we flip tails and think through what we expect to see. We’ve already ruled out a 0 percent chance of seeing heads, and now we can rule out a 100 percent chance of seeing heads. We do now have one of each, though, so I expect the middle of the distribution to swell a bit. Whether it’s a triangle shape or a curve, I’m not sure, but something like that.
Here’s that distribution:
It’s curved! Of course! Well … It’s not really obvious, but it does seem rational, at least.
Now let’s say we flip two more times and they’re both heads:
Getting to an Answer
Let’s stop here and answer the question I initially asked. How likely is this coin to have a ~50 percent chance of flipping heads (where that really means 45 to 55 percent chance, because statistics is an inexact science)? Here’s how we can approximate the answer. From the histogram we have, we just need to add up the probability from each hypothesis.
- 45 percent 0.68%
- 46 percent 0.73%
- 47 percent 0.78%
- 48 percent 0.83%
- 49 percent 0.88%
- 50 percent 0.94%
- 51 percent 0.99%
- 52 percent 1.05%
- 53 percent 1.11%
- 54 percent 1.17%
- 55 percent 1.24%
Adding them all up equals a 10 percent chance.
Notice how each individual hypothesis has a really low likelihood? It being exactly 50 percent is only a .94 percent chance! What’s that about?
This is actually a really cool aspect of this method. You see, any time you deal with probabilities (even in frequentist statistics), you need to deal with ranges of likelihoods. Call them a confidence interval, credibility interval, highest density interval, etc. It’s the same concept.
Also, this whole idea of treating individual probabilities as hypotheses was life altering for me. This concept applies generally to anything where you want to evaluate likelihoods. The y-axis will always be probabilities but the x-axis can be any numerical value. Be it money, time, or what have you.
Actually, since I’ve mentioned credibility intervals, let’s get that from this data as well.
The simplest way to explain a credibility interval is to say it’s essentially a confidence interval that you can easily get from a histogram. It entails saying which hypotheses encompass 95 percent of the likelihoods you have.
I’m running short on time, so we’ll just estimate it. I’ve grabbed the data from the program and put it in a spreadsheet. My next step is to sort the data by its likelihoods in descending order, and grab the first N values that add up to 95 percent. Then I need to find the minimum hypothesis in that range and the maximum.
I’ve done that for our experiment. Ninety-five percent of the likelihood falls between 42 and 98 percent. We can interpret this to mean that we can’t eliminate the possibility that our coin may still come up heads 50 percent of the time … but then we also have some confidence that our coin will come up heads greater than 90 percent of the time. Really, this interval is so wide, it should be interpreted to mean that we need more data.
Wrap It up
That’s the algorithm. It’s extremely straight-forward, but those ~12 lines of code took me a while to truly understand. If you want to learn more, I suggest reading about the binomial distribution. I think you might see some similarities between that and what we did. Read more here.
Also, here’s my full code file in case you want to dive into the code to understand this more fully:
def normalize(possibilities): possibility_sum = sum(possibilities) for hypothesis in xrange(0,101): possibility = possibilities[hypothesis] possibilities[hypothesis] = possibility/possibility_sum def update_success(possibilities): for hypothesis in xrange(0, 101): likelihood = possibilities[hypothesis] possibilities[hypothesis] = likelihood*hypothesis/100.0 normalize(possibilities) def update_failure(possibilities): for hypothesis in xrange(0, 101): likelihood = possibilities[hypothesis] possibilities[hypothesis] = likelihood*(1.0-hypothesis/100.0) normalize(possibilities) # set every possibility to be equally possible possibilities = [1.0]*101 # Coin flip, probability of heads normalize(possibilities) update_success(possibilities) update_failure(possibilities) update_success(possibilities) update_success(possibilities) print possibilities
(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)