Over a million developers have joined DZone.

Understanding Multi-Armed Bandit Algorithms

DZone 's Guide to

Understanding Multi-Armed Bandit Algorithms

· Big Data Zone ·
Free Resource

The Scenario

Imagine you are in front of three slot machines, each with different likelihoods of paying out. How should you play them in order to maximize your payout?

This is the problem the multi-armed bandit algorithm solves.

You might say the solution lies in just trying them all out and seeing which one does better and you’d be right! But how much should you play each slot machine? How little is too little? Here is where it gets tricky. It isn’t too bad though. A little Bayesian stats trick can make this pretty easy. Really all we need to do is play the machine most likely to payout as often as possible.

Reducing the Unknown Iteratively

Now that we know we don’t need to know the exact likelihood for any of the options let’s explore how we can find the best option.

Let’s say we start with machine A, drop a quarter in, and pull back the handle. No pay out. How does that affect our belief in this machine? Turns out it’s just like the way we updated our beliefs in a coin flip in my earlier post. (You can read that here: http://www.databozo.com/2013/09/15/Bayesian_updating_of_probabi lity_distributions.html)

A key feature we want is experimentation of the different options. We can do that by collecting a set of credible hypotheses for the likelihood of a payout and randomly choosing one to try. There is a really handy statistical function we can use to accomplish this called the “Beta” function. The Beta function is defined like this:

random likelihood = Beta(1 + plays resulting in payout, 1 + plays NOT

resulting in payout)

Imagine you’ve played slot machine A ten times without a payout. The beta function would look like this:


%load_ext rmagic
from pretty import pprint

The rmagic extension is already loaded. To reload it, use: %reload_ext rmagic


curve(dbeta(x, 1, 11))

Example Beta Curve

This shows how the likelihoods nearest zero are most likely. Anything above .4 is practically impossible. Of course zero is the single most likely likelihood, but it is not at all the only possibility. So now if we were to sample a likelihood from the above beta function we should see that we get a low lielihood for this machine.



[1] 0.04452788

Here’s where it gets cool! We can use this same technique even for the slot machines we haven’t tried yet. Let’s see if we should try slot machine B next. We’ll evaluate the beta function with no prior information. That’s a Beta(1,1). Above we use rbeta with an extra parameter (the first) to specify how many samples we want to pull from the function. So this will be all one’s.


print("Likelihood for Slot Machine B to payout")

[1] "Likelihood for Slot Machine B to payout"

[1] 0.655021

Well shucks. If Slot Machine A only has a 4.5% chance to pay out but Slot Machine B has a 65.5% chance to pay out, well, we’d be crazy to play Machine A. But, let’s see what Slot Machine C’s likelihood is.


print("Likelihood for Slot Machine C to payout")

[1] "Likelihood for Slot Machine C to payout"

[1] 0.8438229

Just so happens that we sampled an even higher likelihood. Then we play Slot Machine C and track whether we get a success or failure. We can safely assume we fail the next play. We sample from the beta function again for each slot machine with the updated parameters.



print("Likelihood for Slot Machine A to payout")

print("Likelihood for Slot Machine B to payout")

print("Likelihood for Slot Machine C to payout")

[1] "Likelihood for Slot Machine A to payout"

[1] 0.07449518

[1] "Likelihood for Slot Machine B to payout"

[1] 0.740949 [1] "Likelihood for Slot Machine C to payout"

[1] 0.2574801

Since Slot Machine B samples the highest we play that and go on and on. Let’s run a full simulation and see how our results trend over time following this algorithm.


def create_slot_machines():
    return {
        'A': {'true_payout_rate': .1, 'payouts':0, 'plays':0},
        'B': {'true_payout_rate': .05, 'payouts':0, 'plays':0},
        'C': {'true_payout_rate': .01, 'payouts':0, 'plays':0}
{'A': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.1},
'B': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.05},
'C': {'payouts': 0, 'plays': 0, 'true_payout_rate': 0.01}}

These are the true likelihoods for each slot machine. Our multi-armed bandit algorithm won’t know about them. We just use them to decide if we let the machine pay out. We will also track the stats on how each machine has performed.

Now if the simulation works, what we should see is the best slot machine getting played a lot more than the worst, and the middle one getting somewhere in between them.


from numpy.random import beta
from random import random

def sample_likelihood(machine_info):
    payouts = machine_info['payouts']
    plays = machine_info['plays']
    return beta(1 + plays, 1 + plays - payouts, 1)[0]

def machine_to_play(machines):
    expected_likelihoods = [(name, sample_likelihood(machine_info)) for name, machine_info in machines.iteritems()]
    expected_likelihoods.sort(key=lambda element: element[1], reverse=True)
    return expected_likelihoods[:1][0][0]

def play_machine(chosen_machine, slot_machines):
    decision_number = random()
    slot_machines[chosen_machine]['plays'] += 1
    if decision_number <= slot_machines[chosen_machine]['true_payout_rate']:
        slot_machines[chosen_machine]['payouts'] += 1

def run_sim(iterations=1000):
    slot_machines = create_slot_machines()
    for i in xrange(iterations):
        chosen_machine = machine_to_play(slot_machines)
        machine_won = play_machine(chosen_machine, slot_machines)
{'A': {'true_payout_rate': 0.1, 'plays': 539, 'payouts': 50},
'C': {'true_payout_rate': 0.01, 'plays': 199, 'payouts': 2},
'B': {'true_payout_rate': 0.05, 'plays': 262, 'payouts': 8}}

Whew. So out of 1,000 plays, this algorithm played the best slot machine >50% of the time and played the worst one < 20% of the time. The payout rate of our algorithm so far is about 6%. Still much lower than our ideal of 10% but at least it’s better than the two lowest options. If we just randomly selected a slot machine we’d expect a payout rate of 5.3% which means we’re doing about 14% better than random. What happens if we ratchet up to 10,000?


{'A': {'true_payout_rate': 0.1, 'plays': 7576, 'payouts': 768},
'C': {'true_payout_rate': 0.01, 'plays': 738, 'payouts': 5},
'B': {'true_payout_rate': 0.05, 'plays': 1686, 'payouts': 88}}

Looks like >75% of the time it chooses to play the better slot machine and ~7% of the time is playing the worst one. Also it looks like our payout rate is at 8.6% and approaching our ideal rate of 10% (the best possible given the slot machines we have). Now we’re doing 63% better than if we just randomly chose slot machines!


I’m a little surprised to see how long it takes for the algorithm to begin exploiting an option. It doesn’t mean this isn’t optimal but I’m willing to be there are probably some hacks one could make or some general prior knowledge one could feed into the algorithm to improve it. That’s all there is to a simple multi-armed bandit algorithm. Hopefully I’ve explained it well enough that you can think of new ways to apply it on your own. Really it all comes down to an understanding of what the beta function is doing and what the parameters you pass to it mean.

There are probably more optimal and complex ways to go about this as well but MEH! It’s easier to learn something if it can be kept simple.

Thanks for reading!

(Note: This article and the opinions expressed are solely my own and do not represent those of my employer.)


Published at DZone with permission of

Opinions expressed by DZone contributors are their own.

{{ parent.title || parent.header.title}}

{{ parent.tldr }}

{{ parent.urlSource.name }}