Weiming Chen

Oct 17, 2021

3 min read

Solving Common Probability Problems with Python Pt.2 — Continuous Data

Continuous Data measures scale units, e.g., height, weight, speed, time. The value can go ‘continuously,’ unlimited within any two numbers.

For example: 1.0000001 meters, 65.98574123213kg, 100.5kg, 100.999333kg. For every two variables, we can have infinite possible values in between. It is almost impossible to count each exact value accurately, but we can arrange them into intervals for better representation.

(In contrast to Discrete Data, examples of counting results of a coin toss, counting sale. We assure always have a discrete value, number of ‘head/tail,’ number of items sold, etc. )

The Continuous Probability Distribution ideally describes the probability that a value falls in a particular range. Continuous Probability Distribution can be Normal Distribution (Gaussian distribution) — the bell shape curve.

Probability Density Function (PDF) and Cumulative Distribution Function (CDF) can be both worked to continuous data.

Probability Density Function (PDF) calculates a probability for an exact data point. However, if the dataset is large enough, the product density function value will approach very little towards zero. (Such as asking, “what is the probability of 58.573 kg weight in all women in an area?”) A single value of continuous data’s probability would not tell you much.

For continuous data, a more meaningful way to work on is the Cumulative Distribution Function (CDF) to determine the probability within an interval. (Change above question to check for weights such as 58–59KG range or 55–60kg range.)

For example, heights of 1.90 m — 2.10 m interval falls in 70% interval of NBA Players Heights distribution.

Subtract a CDF to get the gap.

norm_dist.cdf(210) - norm_dist.cdf(190)

# 0.71463211317374

For a small dataset that has minimal data points, the pdf() method is OK.

It is more meaningful for a large dataset with countless continuous data to use cdf() to find out the probability between two values.

Below is an example working with NBA Players’ heights.

import numpy as np
import pandas as pd
from scipy.stats import norm
import matplotlib.pyplot as plt

df = pd.read_csv('files/Players.csv')


df['height'].describe()


"""
count 3921.000000
mean 198.704922
std 9.269761
min 160.000000
25% 190.000000
50% 198.000000
75% 206.000000
max 231.000000
Name: height, dtype: float64
"""


# Get a Normal Distribution instance for our height data.
dist = norm(df['height'].mean(), df['height'].std())

# Calculate the interval probability
p = dist.cdf(220) - dist.cdf(210)

# 0.10071770329577223

Quickly check the distribution shape.

# x-axis: all height data points
# y-axis: probability for each individual data point. (calculated by pdf())

plt.bar(df['height'], df['height'].apply(lambda x: dist.pdf(x)))
plt.show()

Note that the area under the curve is always equal to 1, representing all data points’ probabilities.

The sliced part is the interval we want to find out. In this case, 10%. (Apologies for my terrible hand drawing.)

The likelihood that a NBA player between 2.10–2.20 m is 10%.

You can use the Product Density Function and Cumulative Distribution Function to work around with continuous data.

Continuous Probability Distributions can have the type of Normal Distribution (aka Gaussian Distribution) with a central tendency and bell curve style fit. We can use scipy.stats.norm to have a distribution instance by passing in mean and standard deviation and finding the likelihood within an interval.

For more references: