# The 'Probability to Win' Is Hard to Estimate

# The 'Probability to Win' Is Hard to Estimate

A big data and data science expert looks into this difficult problem of statistics using the R language to find results and visualize the data.

Join the DZone community and get the full member experience.

Join For Free**How to Simplify Apache Kafka. Get eBook.**

Real-time computation (or estimation) of the "probability to win" is difficult. We've seem that in soccer games, in elections... but actually, as a professor, I see that frequently when I grade my students.

Consider a classical multiple choice exam. After each question, imagine that you try to compute the probability that the student will pass. Consider here the case where we have 50 questions. Students pass when they have 25 correct answers, or more. Just for simulations, I will assume that students just flip a coin at each question... I have n students, and 50 questions.

```
set.seed(1)
n=10
M=matrix(sample(0:1,size=n*50,replace=TRUE),50,n)
```

Let Xi,j denote the score of student i at question j. Let Si,j denote the cumulated score, i.e. Si,j=Xi,1+⋯+Xi,j. At step j, I can get some sort of prediction of the final score, using Ti,j=50×Si,j/j. Here is the code:

```
SM=apply(M,2,cumsum)
NB=SM*50/(1:50)
```

We can actually plot it:

```
plot(NB[,1],type="s",ylim=c(0,50))
abline(h=25,col="blue")
for(i in 2:n) lines(NB[,i],type="s",col="light blue")
lines(NB[,3],type="s",col="red")
```

But that's *simply* the prediction of the final score, at each step. That's not the computation of the probability to pass! Let’s try to see how we can do it… If after j questions, the student has 25 correct answers, the probability should be 1 – i.e. if Si,j≥25 – since he cannot fail. Another simple case is the following: if after j questions, the number of points he can get with all correct answers until the end is not sufficient, he will fail. That means if Si,j+(50−i+1) < 25 the probability should be 0. Otherwise, to compute the probability to sucess, it is quite straightforward. It is the probability to obtain at least 25−Si,j correct answers, out of 50−j questions, when the probability of success is actually Si,j/j. We recognize the survival probability of a binomial distribution. The code is then simply:

```
PB=NB*NA
for(i in 1:50){
for(j in 1:n){
if(SM[i,j]>=25) PB[i,j]=1
if(SM[i,j]+(50-i+1)<25) PB[i,j]=0
if((SM[i,j]<25)&(SM[i,j]+(50-i+1)>=25)) PB[i,j]=1-pbinom(25-SM[i,j],size=(50-i),prob=SM[i,j]/i)
}}
```

So if we plot it, we get:

```
plot(PB[,1],type="s",ylim=c(0,1))
abline(h=25,col="red")
for(i in 2:n) lines(PB[,i],type="s",col="light blue")
lines(PB[,3],type="s",col="red")
```

which is much more volatile than the previous curves we obtained! So yes, computing the "probability to win" is a complicated exercise! Don't blame those who find it hard to do!

Of course, things are slightly different if my students don't flip a coin... this is what we obtain if half of the students are good (2/3 probability to get a question correct) and half are not good (1/3 chance):

If we look at the probability to pass, we usually do not have to wait until the end (the 50 questions) to know who passed and who failed.

PS: I guess a less volatile solution can be obtained with a Bayesian approach... if I find some spare time this week, I will try to code it...

**12 Best Practices for Modern Data Ingestion. Download White Paper.**

Published at DZone with permission of Arthur Charpentier , DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

## {{ parent.title || parent.header.title}}

{{ parent.tldr }}

## {{ parent.linkDescription }}

{{ parent.urlSource.name }}