A formal introduction to probability

tl;dr: why Bayes theorem is a neat way to work with probabilities

The concept of probability, a cornerstone in Machine Learning, boasts a multitude of interpretations and definitions. Despite its mathematical rigor, most individuals already possess an intuitive grasp of its essence in some cases, yet in other situations, they may find themselves pleasantly surprised by the presented facts. This blog post provides the fundamentals of Probability Theory, establishing a rigorous grasp of the mathematical underpinnings to ensure intuition remains robust when scaling to contemporary complex, high-dimensional models.

From events to probability

Consider a coin tossing experiment with heads $H$ and tails $T$, with the sample space being the set $\Omega=\{H, T\}$. An event is a subset of the sample space and the event space $\mathcal{F}$ includes all subsets $\{ \emptyset \}$, $\{H\}$, $\{T\}$ and $\{H,T\}$. For example, the event space of a standard deck of playing cards include “hearts” (13 cards), “cards” (52 cards), “Aces” (4 cards), etc. Finally, outcomes are the elements of the sample space $\omega\in\Omega$, and are possible results of a trial. For example, the outcomes in the coin tossing example are either $H$ or $T$. Furthermore, if the trials with a deck of cards entails drawing from the deck once, then the outcomes are one of those individual 52 cards.

Now, let us define a random variable $X$. This is a measurable function $X: \Omega \rightarrow A$ that maps outcomes $\omega$ from the sample space $\Omega$ to a state space $A$ (e.g., $\{0,1\}$ or $\mathbb{R}$). The role of the random variable is to map potential outcomes to a space with measurable properties, essentially translating reality to data. Think of a restaurant with raw ingredients (the sample space $\Omega$) and the menu (event space $\mathcal{F}$). To have a functioning restaurant, we need to combine the dishes with the menu. This can be done with a structure known as the measurable space $(\Omega, \mathcal{F})$. We can then define random variables that indicate specific values for each dish, such as price or calories. A probability measure ($P$) can then assign values (likelihoods) to these events. However, this can only be done when we have defined the full probability triplet $(\Omega, \mathcal{F}, P)$. In other words, probabilities can only be assigned to items actually listed on the menu ($\mathcal{F}$).

Crucially, random variables map outcomes to numbers, while probability measures map events to probabilities. This enables us to ask question such as: what is the probability of a subset of the state space, i.e., $S \subseteq A$? For example, what is the probability of an order costing more than €10? Since the probability measure $P$ acts on food (outcomes) and not numbers, we must first map the target values back to the menu items. This is known as the pull-back $X^{-1}(S)$, which creates a subset of $\Omega$ and can be assigned probabilities as

$$ P_X(S) =P(X^{-1}(S)) = P(\omega \in \Omega \mid \{X(\omega)\in S \}). $$

If $ X $ is countable, then we refer to it as a discrete random variable. If it is not countable, i.e., $ X $ is defined in interval ranges, then it is referred to as a continuous random variable. For example, a discrete random variable can be the number of cards in your hand, while a continuous random variable can be the time it takes to shuffle the deck. The relationship of the random variable with probability is described by a probability density function (PDF), which assigns values to a range of values. A point value of continuous $ X $ has zero probability and its value on the PDF is referred to as the likelihood. On the other hand, the terminology for likelihood and probability coincide in discrete cases, described by the probability mass function (PMF).

High-level architecture — Probability density function

Useful identities

Consider a probability space $(\Omega, \mathcal{F}, P)$ and two events $ A, B \in \mathcal{F} $. The probability of the union of these events is given by the inclusion-exclusion principle \begin{equation} P(A \cup B) = P(A) + P(B) - P(A \cap B). \end{equation} If the events $ A $ and $ B $ are statistically independent, their intersection is defined as \begin{equation} P(A \cap B) = P(A) P(B), \end{equation} A special case arises when two events are mutually exclusive (or disjoint), meaning they cannot occur simultaneously. In this scenario, the intersection contains no outcomes, i.e. $P(A \cap B) = 0$. Consequently, the inclusion-exclusion principle simplifies to the direct sum of their individual probabilities: \begin{equation} P(A \cup B) = P(A) + P(B). \end{equation} We must also consider the scenario where outcome $A$ does not occur, also known as the complement of an event. Since an event and its complement collectively cover the entire sample space (exhausting all possibilities), their probabilities must sum to unity. This relationship is described as \begin{equation} P(\neg A) = 1 - P(A) \end{equation} Finally, we define conditional probability, which allows us to update our beliefs based on new information. The probability of $A$ occurring, given that event $B$ has already observed, is defined by the identity: \begin{equation}\label{eq:condpr} P(A \vert B) = \frac{P(A \cap B)}{P(B)}, \end{equation} provided that $ P(B) \neq 0 $. Conceptually, this operation restricts our sample space from the total universe $\Omega$ down to just the outcomes in $B$. We can extend this concept to the Law of Total Probability. Let us assume a countable set of disjoint events $\{B_n\}$ that forms a partition of the sample space $\Omega$, meaning that $\bigcup_n B_n = \Omega$ and $\sum_n P(B_n) = 1$. The probability of any event $A$ can be found by summing its intersections with each part of this partition. By applying the product rule, we arrive at the standard formulation: \begin{equation}\label{eq:lotp} P(A) = \sum_n P(A, B_n) = \sum_n P(A \vert B_n) P(B_n). \end{equation} This law effectively allows us to determine the marginal probability of $A$ by considering all possible scenarios of $B$. As a consistency check, consider the case where $A$ is statistically independent of all $B_n$.

Joint probabilities

Consider a simple example of a discrete probability mass function $ P_X $ of a random variable $ X \in \{0, 1\} $. We define another random variable, $ Y $, and describe their joint probability distribution in the table below:

$ P_{X,Y} $	$ X = 1 $	$ X = 0 $
$ Y = 1 $	0.1	0.3
$ Y = 0 $	0.05	0.55

Note that we use the comma to indicate the intersection of events ($P(X \cap Y)$). We can calculate the \textit{marginal} probability of $X$ using the Law of Total Probability. By summing over all possible states of $Y$, we obtain:

$$ \begin{aligned} P_X(X=0) &= \sum_{y \in \{0,1\}} P_{X,Y}(X=0, Y=y) \\ &= P_{X,Y}(0,0) + P_{X,Y}(0,1) \\ &= 0.55 + 0.3 = 0.85. \end{aligned} $$

In other words, the probability of $ X=0 $ regardless of the value of $ Y $ is the sum of the cases where $ Y=0 $ and $ Y=1 $. This concept extends to continuous variables. Assuming $ X, Y \in \mathbb{R} $ are continuous, the summation is replaced by an integral. The Law of Total Probability allows us to find the marginal density $ p_X(x) $ by integrating out $ Y $:

$$ p_X(x) = \int_{-\infty}^{\infty} p_{X,Y}(x,y) \, \mathrm{d}y = \int_{-\infty}^{\infty} p_{X \mid Y}(x \mid y) p_Y(y) \, \mathrm{d}y. $$

While these identities may seem abstract, they form the backbone of probabilistic modeling for two practical reasons.

Decomposition: It is often difficult to estimate a complex probability directly from raw data. However, it is much easier to estimate causal relationships such as $P(X \mid Y)$. The Law of Total Probability allows us to build the complex “global” probability $P_X$ by combining simpler, “local” building blocks $P_{X\mid Y}$.
Marginalization: In real-world data, our observations ($X$) are often affected by noise or environmental factors ($Y$) that we cannot control. For example, a recognition model receives data ($X$) affected by noise ($Y$). By integrating out $Y$ (marginalization), we can make predictions about the speech content that are robust, effectively “averaging out” the varying noise conditions.

Bayes rule

A fundamental property of the joint probability is its symmetry. The probability of $X$ and $Y$ occurring together is identical regardless of the order in which we consider them, i.e., $P(X, Y) = P(Y, X)$. By expanding both sides using the conditional identity defined earlier, rearranging the terms allows us to express the \textit{posterior} probability of $Y$ given $X$ in terms of the reverse conditional probability

$$ \underbrace{P(Y \mid X)}_{\text{posterior}} = \frac{\overbrace{P(X \mid Y)}^{\text{likelihood}} \, \overbrace{P(Y)}^{\text{prior}}}{\underbrace{P(X)}_{\text{evidence}}}. $$

This simple algebraic rearrangement represents a profound shift in perspective. While the Law of Total Probability (discussed in the previous section) allows us to reason forward from causes to effects, Bayes’ Theorem allows us to perform inverse or reverse probability. In this case reasoning backwards from observed effects data $X$ to uncertain causes $Y$. The denominator, or evidence, serves as a normalization constant ensuring the posterior sums to 1. While the likelihood describes how the data is generated, the prior acts as the bridge between past knowledge and current inference. The prior also imposes structure, effectively penalizing unlikely explanations, which is often the source of both the power and the controversy of Bayesian methods. Critics often argue that the prior introduces subjectivity. However, from a Bayesian perspective, all models make assumptions. A frequentist model that relies solely on the likelihood implicitly assumes a uniform (flat) prior, implying that all outcomes are equally probable a priori. Bayes’ rule forces us to make these assumptions explicit and mathematically tractable.

Occam’s razor

Consider predicting a hypothesis $H$ as a strategy for predicting which data $D$ will occur. A simple hypothesis is strict and specific; it concentrates its probability mass over a narrow range of outcomes, resulting in high-magnitude peaks. Conversely, a complex hypothesis is flexible and capable of explaining a wide variety of outcomes. To account for this broad range of potential data, it must spread its probability mass thinly, as the likelihood function must integrate to exactly $1$ over the entire space of possible data outcomes. Consequently, for any single specific observation $D$, the likelihood $P(D|H)$ of the complex model is comparatively small. When substituted into Bayes’ theorem, a simple hypothesis yields a much larger numerator compared to a complex one, because it did not waste probability mass on outcomes that did not occur. Thus, Bayes’ rule naturally implements Occam’s Razor, where it effectively penalizes the complex model for its vagueness.

Preservation of uncertainty

Bayes’ rule can also be viewed as a mechanism for intellectual honesty, governing how the spread of our belief distribution changes (or refuses to change) in light of new data. If the observed data is noisy, sparse, or ambiguous, the resulting likelihood function will be diffuse, failing to strongly distinguish between hypotheses. Mathematically, since the posterior $P(H|D)$ is the product of the prior and the likelihood, a weak likelihood fails to “sharpen” the prior, causing the posterior to remain broad. Unlike optimization methods that force convergence to a single point estimate (effectively a Dirac delta), Bayes’ rule preserves the distribution’s width as long as possible. It refuses to narrow the range of possibilities unless the data provides overwhelming evidence to do so, acting as a conservative update rule that prevents overconfidence.

Influence of the prior

To illustrate the critical role of the prior, consider a rare disease $ K $ with prevalence $ P(K^+)=0.01 $ and a test $ T $ with 99% accuracy. While intuition suggests that a positive test result $ T^+ $ implies a near-certain infection, Bayes’ Theorem reveals a different reality. By expanding the evidence term $ P(T^+) $ via marginalization (Law of Total Probability), we find:

$$ \begin{aligned} P(K^+\vert T^+) &= \frac{P(T^+\vert K^+)P(K^+)}{P(T^+)} \\ &=\frac{P(T^+\vert K^+)P(K^+)}{\underbrace{P(T^+\vert K^+) P(K^+) + P(T^+\vert \neg K^+) P(\neg K^+)}_{\text{Law of Total Probability}}} \\ &= \frac{0.99\times 0.01}{0.99\times 0.01 + 0.01\times 0.99} = 0.5, \end{aligned} $$

which is significantly lower than the test accuracy suggests. This demonstrates that without incorporating the prior (the low base rate of the disease), relying solely on the likelihood leads to a false sense of certainty. To expand on this, we can observe how Bayesian inference naturally handles accumulating evidence. Suppose we administer a second test, $ T_2 $, which is conditionally independent of the first. Because the tests are independent, the likelihood of two false positives drops drastically ($ 0.01^2 $), while the likelihood of two true positives remains high. Applying Bayes’ rule again for the joint events $ T_1^+, T_2^+ $ as

\begin{equation} P(K^+ \vert T_1^+, T_2^+) = \frac{P(T_1^+ \vert K^+) P(T_2^+ \vert K^+) P(K^+)}{P(T_1^+, T_2^+)} \approx 0.99. \end{equation}

With the second confirmation, the probability jumps from 50% to 99%. This illustrates the core strength of the Bayesian framework. It serves as an iterative process where the posterior belief becomes the prior for the next observation, allowing the model to learn and become more confident as data accumulates.

Closing thoughts

The foundational knowledge of probability theory helps us understand how to reason with uncertainty, update beliefs based on new information and thereby making our models more trustworthy. However, translating these concepts to contemporary machine learning models is not straightforward. Fortunately for us, this problem has been researched for decades and has resulted in many interesing probabilistic machine learning architectures.