Solving Common Probability Problems with Python Pt.1 — Binomial
In statistics, data analysis, or data science related projects, probability is always fundamental.
The entire world is full of randomness, but the creator of the universe makes it somehow certain with uncertainty. Like the Normal Distribution, the bell curve makes the disorder in orderly ways. I thought the randomness of the world is by-design because the randomness it is also somehow deterministic.
We sometimes try to estimate probability of events by making intuitive thoughts, but the way our brain “intuition” part often fool us. Luckily, we don’t have to install proprietary statistics software to do the job, some Python code will solve for us. The key is to translate the cases to fit in which styles of distribution, then parameterize variables and functions.
A dice game example: Rolling Dice bar game.
In my generation culture, people play a dice game in a bar, karaoke, which using five dices for each player. Rolling to get a value of ONE of a dice is considering most rewarding and precious because you then can arbitrary assign any value of (1–6) to that dice. (So that you would be more likely to generate a pair, a triple).
For example: if you get the rolling result of:
(ONE-FIVE-FIVE-TWO-TWO),
you then are eligible to turn it to
(5–5–5–2–2) or (2–5–5–2–2) etc.
In my hometown, the game is named “The Liar”. (There are many versions of the game). With the magic power of having ONE(s), your chances of beating others boost.
Scenario One:
If I want to have 3 ONEs with 5 rolling dices:
Question to answer:
Q: How likely to have at least 3 ONE(s) in 5 rolling dices?
(I roll one dice at a time, each dice rolling is independent.)
To feed to code, we identify it as Binomial Distribution, which means YES/NO.
(YES getting a ONE) or (NOT getting a ONE).
YES or NO, Success or Fail, binary style, that’s why it’s called “binomial”.
Getting a ONE is one of six chances, 1/6 = 16.67%.
Total Number of trials = 5.
Cumulative Times ≥ 3 (at least three ONES)
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
# Parameterize the case. Variable names are self explained.
total_trial = 5
yes_odds = 1 / 6
num_of_yes = 3
culm_less_equal = num_of_yes - 1
# Declare the statistics binom instance
binom = stats.binom(total_trial, yes_odds)
# Compute the probability of cumulative density of less and equal to success number of ONE(s). (0, 1, 2 times)
less_equal_cdf = binom.cdf(culm_less_equal)
# Remaining is cumulative of greater and equal than 3 ONEs. (3, 4, 5 times)
greater_cdf = 1 - less_equal_cdf
print(f"You have {round(greater_cdf * 100, 2)}% chance of having at least {num_of_yes} x ONEs. Good luck!")
# (optional below) Graph it.
# We want individual probability outcomes as list values.
# pmf() to get individual data point.
# Declare and initialize the empty data variable.
pmf_dict = {
"xtimes": [],
"probability": []
}
# Compute exact probability of 0xONE, 2xONE, 3xONE, ... so forth to 5xONE.
# add them to pmf dictionary. Using probability mass function (PMF)
for i in range(6):
pmf = binom.pmf(i)
pmf_dict["xtimes"].append(i)
pmf_dict["probability"].append(pmf)
# Plot. visualize the data
df = pd.DataFrame(pmf_dict)
print(df)
df.plot.bar(y="probability", x="xtimes")
plt.show()
# Simulation (optional) of counting test.
# Test for ever 100 rounds: how many success events (more than 3 ONEs) we have. do 10 x 100 rounds.
print("\n --- Simulation Starts --- \n")
for i in range(10):
event_count = 0
simulated = binom.rvs(100)
for j in range(100):
if simulated[j] >= 3:
event_count += 1
print(f"{event_count} TIMES of having at least three ONEs in 100 rounds.")
print("\n --- Simulation Ends --- \n")
xtimes probability
0: 0.401878
1: 0.401878
2: 0.160751
3: 0.032150
4: 0.003215
5: 0.000129
Each row represents exact time(s) of having ONE(s). As you can see, having no ONE and having only one ONE are equally likelihood of 40%; a pair of ONEs is 16% odds; while to have exact three ONEs, the chance drops significantly to 3.2%. To get at least three times of ONEs, sum up rows of (3 xtimes + 4 times + 5 times), we get 3.5%. Handy way is simply using CDF (cumulative method) shown above, if the total number of trials are large, we definitely should pick CDF to do the job.
By looking at the graph, straightforward.
Ok. The cumulation of having at least three ONEs of this game is 3.5%. (Surely, you don’t need any appearance of ONE to win). I am not a master of this game, but I do find myself unbeatable every time I own at least 3 ONEs. So, my invincible odds will be 3.5% (Sad)
It is a typical case that how our intuition wiring could fool us, I normally thought it was about 10–15% I can get at least 3 ONEs. (Maybe I am just too optimistic). But, it is surprisingly unlikely to nail it. Thanks to Binomial Distribution computation, it let me see through reality.
To remind that the requirements of fitting binomial is
- Has to be two possible outcomes. (Y/N)
- Each outcome is independently fixed.
Ok. I am really terrible at that game, and my friends beat me easily with sneaky strategies. (That’s out of discussions here). But it is fun indeed.
Simulations results:
— — Simulation Starts — -
3 TIMES of having at least three ONEs in 100 rounds.
5 TIMES of having at least three ONEs in 100 rounds.
4 TIMES of having at least three ONEs in 100 rounds.
4 TIMES of having at least three ONEs in 100 rounds.
0 TIMES of having at least three ONEs in 100 rounds.
4 TIMES of having at least three ONEs in 100 rounds.
8 TIMES of having at least three ONEs in 100 rounds.
6 TIMES of having at least three ONEs in 100 rounds.
1 TIMES of having at least three ONEs in 100 rounds.
4 TIMES of having at least three ONEs in 100 rounds.— — Simulation Ends — -
Scenario Two:
Let’s now see how Geometric Distribution plot comes handy. Same context of dice game:
Question to answer:
What is the probability of first time getting a ONE in each rolling. (E.g., what is the likelihood of getting a ONE in the very first trial; 2nd, 3rd, 4th, 5th? )
It is a style of Geometric Distribution question.
Let’s fit into the function
from scipy import stats
import matplotlib.pyplot as plt
import pandas as pd
# Parameterize the case.
total_trial = 5
yes_odds = 1 / 6
# Geometric Distribution instance
geom = stats.geom(yes_odds)
# Declare the empty data variable.
pmf_dict = {
"num_of_trial": [],
"probability": []
}
for i in range(total_trial):
pmf_dict["num_of_trial"].append(i + 1)
pmf_dict["probability"].append(geom.pmf(i + 1))
# Make the DataFrame instance.
df = pd.DataFrame(pmf_dict)
print(df)
# Plot
df.plot.bar(x="num_of_trial")
plt.show()
# Get the cumulative probability
cdf = geom.cdf(5)
print(f"\nFor each round(5 rolling), we have {round(cdf * 100, 2)}% chance of having a ONE.")
num_of_trial probability
1 : 0.166667
2 : 0.138889
3 : 0.115741
4 : 0.096451
5 : 0.080376
Telling from data:
- to have a ONE showing in the very first rolling is 16.67% which is very intuitive 1/6.
- To have a ONE first time showing in the very 2nd rolling is 13.89%,
- in 3rd 11.56%, in 4th 9.65%, in 5th 8.04%.
To sum them up, you get a cumulative value of 59.81%, which means the total probability of showing up a ONE within 5 total trials is 59.8%. (Use cdf function does the cumulation. ) We are nearly 60% likely to get a ONE in a round. Not bad.
Wrapping up.
Binomial Distribution aka (Bernoulli Distribution) is so commonly used and close to our lives as we constantly face 1/0 style scenario: win or go-home, female or male, voted or not voted, sold or not sold, pass or fail. The key is translating the parameters of the case into fitting which distribution functions.
The objective of this post is to illustrate how a couple of quick lines of code (from stats library) can help us to compute probability stuff to back up intuitive decisions. And call for pandas for putting them into data rows and columns, (think of it as a tiny database), and with matplotlib, we can visualize them to do simple analysis. I use Pyto iOS (tiny IDE) on my iPhone to code above and plot them while I was on the train, portable and fast for quick probability calculations. (Thanks for the open sources and all these libraries.) If you prefer GUI, proprietary apps, for example Microsoft Excel function BINOMDIST() can also do it.
To extend, the most fascinating distribution is the famous Normal Distribution (aka Gaussian Distribution). Let’s do the computation of Normal Distribution for the next post.
For more references: