Probabilistic machine learning

tl;dr: how we can model distributions and their uncertainty

Probabilistic frameworks moves beyond deterministic, single-point estimates, like simple zeros and ones, to provide a full spectrum of confidence values, showing us how to properly model and capture uncertainty. Producing a probability distribution, rather than a single deterministic answer, enables more principled and risk-aware decision-making. This blog post introduces the probabilistic modeling framework derived from Bayes’ theorem and shows how it enables characterization of uncertainty.

Generative classification

Consider a classification problem with $K$ classes and dataset $\mathcal{D}=\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^N$ containing $N$ independent and identically distributed (i.i.d) one-hot encoded ground-truth $\mathbf{y}\in\mathbb{R}^{C\times D}$ and image $\mathbf{x}\in\mathbb{R}^{K\times D}$ pairs sampled from $P_{\mathbf{Y},\mathbf{X}}$, where $C$ denotes the channel depth of the images. The Bayes rule can be employed to find the image-conditional class distribution \begin{equation} p(\mathbf{y}\vert \mathbf{x}) = \frac{p(\mathbf{x}\vert\mathbf{y}) p(\mathbf{y})}{p(\mathbf{x})}, \end{equation} where $p(\mathbf{x}\vert\mathbf{y})$ is the likelihood function, $p(\mathbf{y})$ the prior belief in the distribution of the classes, and $p(\mathbf{x})=\int p(\mathbf{x},\mathbf{y})d\mathbf{y}$ the evidence. Assuming we have found a way to model these distributions, the Bayesian paradigm offers the best possible posterior, i.e. the image-conditional class distribution. This does not mean it is the true posterior. Bayes’ theorem guarantees consistency, where if the true data-generating process is contained within the model family and you have enough data, it will converge to the truth. Hence, the assumptions made during this process significantly shape the resulting posterior. This can pose challenges, especially when there’s a lack of comprehensive domain knowledge about the data. Nonetheless, this way of modeling can performs well with small datasets and can handle unlabeled data $p(\mathbf{x})$. Also, we know what features of the data determine its class-conditional likelihood. This is a very powerful advantage, since we can explore the data and generate new data through sampling from the likelihood.

Discriminative classification

It is crucial to acknowledge that we have thus far deferred the discussion of parameter estimation to focus on inference under the assumption that the parameters and the induced distributions are known. Before addressing parameter estimation, it is pertinent to question the necessity of explicitly modeling intermediate densities when the primary objective is frequently the posterior distribution alone. As Vapnik¹ famously cautioned, one should avoid solving a more general problem as an intermediate step when a direct solution is available. This principle suggests that we should focus immediately on the specific task at hand. Since our primary objective is to determine the class label, there is no need to incur the computational cost of modeling the complete underlying data distribution when we can solve for the decision boundary directly. This is known as discriminative modeling, where we can learn the posterior directly, rather than inferring the posterior via Bayes’ rule.

Given a functional form of the posterior $ p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) $, parameterized by $ \boldsymbol{\theta} $, this approach circumvents the need to model the likelihood $ p(\mathbf{x} \mid \mathbf{y}) $ effectively ignoring the high dimensionality of the input space to focus solely on the class distinctions. In essence, we are modeling the conditional distribution which defines the decision boundary. Discriminative modeling is computationally easier and robust to outliers without requiring extensive domain knowledge (since we do not need to make assumptions), it comes at a cost as well. These methods typically demand large labeled datasets which can be expensive to acquire and they often lack the interpretability inherent to generative models.

High-level architecture — Generative classification: use the likelihood of the data itself

Parameter estimation and uncertainty

As mentioned in the previous section, we can omit performing cumbersome Bayesian inference with \textit{discriminative modeling}. The question remains, how can we obtain the optimal parameters? We can use Bayesian inference as

$$ p(\boldsymbol{\theta} \mid \mathbf{x}, \mathbf{y}) = \frac{p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) p(\boldsymbol{\theta})}{p(\mathbf{y} \mid \mathbf{x})}. $$

assuming independence of $\mathbf{x}$ from $\boldsymbol{\theta}$, and where $p(\boldsymbol{\theta})$ represents the prior belief on the parameter distribution and $p(\mathbf{y}\vert\mathbf{x})$ the conditional data likelihood (i.e., the \textit{evidence}). After obtaining a posterior with dataset $\mathcal{D}=\{\mathbf{x}_i, \mathbf{y}_i\}_{i=1}^N$ containing $N$ images, the predictive distribution for a new datapoint $\mathbf{x}^*$ can be denoted as

$$ p(\mathbf{y}\mid \mathbf{x}^*, \mathcal{D})=\int \underbrace{p(\mathbf{y}\mid \mathbf{x}^*,\boldsymbol{\theta})}_{\text{Data}} \overbrace{p(\boldsymbol{\theta}\mid \mathcal{D})}^{\text{Model}} \mathrm{d}\boldsymbol{\theta}. $$

As evident, both the variability in the empirical data and the inferred parameters of the model influence the predictive distribution. Given this, a straightforward approach to quantify the total uncertainty is achieved by obtaining the predictive entropy, $H[\,\mathbf{y}\vert\mathbf{x}^*,\mathcal{D}\,]$, i.e. the entropy of the predictive distribution $p(\mathbf{y}\vert \mathbf{x}^*, \mathcal{D})$, as

$$ \begin{align} H[\,\mathbf{y}\vert\mathbf{x}^*,\mathcal{D}\,]= \overbrace{I[\mathbf{y},\boldsymbol{\theta} \mid \mathbf{x}^*,\mathcal{D}]}^{\text{epistemic}} + \underbrace{\mathbb{E}_{q(\boldsymbol{\theta}\mid\mathcal{D})}[\,H[\mathbf{y}\mid\mathbf{x}^*,\boldsymbol{\theta}]\,]}_{\text{aleatoric}}, \tag{3} \end{align} $$

where $I$ represents the mutual information. The first term represents the expected information gain about the model parameters $\boldsymbol{\theta}$ given the true label $\mathbf{Y}$. A high value indicates the model is very uncertain about its prediction, meaning observing the true label would provide a large information gain, called the epistemic uncertainty. The second term represents the inherent randomness in the data, found by averaging the predictive uncertainty across the posterior of plausible models. This reflects the irreducible noise that cannot be reduced, regardless of how much more training data is observed, called the aleatoric uncertainty.

Hence, the nature of the modeled uncertainty strongly determines appropriate decision-making (related blogpost). Nonetheless, determining the nature of uncertainty is rarely straightforward. For example Hullermeier and Waegeman² state that “by allowing the learner to change the setting, the distinction between these two types of uncertainty will be somewhat blurred”. This sentiment is also shared by Der Kiureghian and Ditlevsen³, noting that “in one model an addressed uncertainty may be aleatory, in another model it may be epistemic”. Especially with increasingly complex methodologies, treating the uncertainties as separate concepts is mostly theoretical (often even more philosophical) and highly non-trivial in practice.

Maximum Likelihood Estimation

Discriminative modeling essentially omits the need to model intermediate densities not directly relevant to our problem. However, we still need to perform Bayesian inference w.r.t. the parameters and thus, have to face the difficult term in the denominator, $ p(\mathbf{y} \mid \mathbf{x}) $. This value can be determined through marginalization over the parameters through \begin{equation} p(\mathbf{y} \mid \mathbf{x}) = \int p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) p(\boldsymbol{\theta}) , \mathrm{d}\boldsymbol{\theta},\label{eq:intract} \end{equation} but can quickly become computationally infeasible with increasing number of parameters in our model. Since we are ultimately interested in the likelihoods, it can make sense to simply maximize $ p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) $ directly with Maximum Likelihood Estimation (MLE) as

$$ \operatorname{MLE} = \arg\max_{\boldsymbol{\theta}} \, \mathbb{E}_{ P_{\mathbf{Y} \mid \mathbf{X}}} \left[ \log p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \right], $$

completely omitting the infeasible integral. A common approach to do this is to interpret the output of a function $f:\mathbb{R}^D\rightarrow\mathbb{R}^K$ as the parameters of a PMF. To model $p(\mathbf{y}|\mathbf{x}, \boldsymbol{\theta})$, we make use of a function $f_{\boldsymbol{\theta}}:\mathbb{R}^{C\times D}\rightarrow\mathbb{R}^{K\times D}$ that infers the parameters of a Probability Mass Function (PMF). For instance, consider spatial image dimensions $D$ and a CNN with $\mathbf{a}=f_\theta (\mathbf{x})$. Then, we can write

$$ p(\mathbf{y}=k\,\vert\,\mathbf{x}^*,\boldsymbol{\theta})=\frac{e^{\mathbf{a}_k}}{\sum_k e^{\mathbf{a}_k}}, $$

with channel-wise indexing over the denominator, which is commonly known as the SoftMax activation.

Hence, this approach is probabilistic modeling in the technical sense, although it is not referred to as such in common nomenclature. Furthermore, it is also possible to incorporate prior knowledge on the parameter distribution. This is known as Maximum a Posteriori (MAP) estimation and is denoted as

$$ \operatorname{MAP} = \arg\max_{\boldsymbol{\theta}} \, \mathbb{E}_{P_{\mathbf{Y} \mid \mathbf{X}}} \left[ p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) + p(\boldsymbol{\theta}) \right]. $$

This can already be enforced with regularization on the parameters. For example, constraining the weights to the L₁-norm assumes Laplacian-distributed weights. Using L₂-norm regularization assumes a Gaussian prior on the weight distribution. You are likely more familiar with Maximum Likelihood Estimation (MLE) for likelihood modeling than you think. In fact, standard softmax activated neural networks are a direct implementation of the theory described above. The network’s output is normalized to form a valid probability distribution over the classes, and the objective function then maximizes the likelihood assigned only to the correct class index.

The KL divergence

In parametric estimation, we do not learn the posterior directly. Instead, we select a parametric family of distributions $\mathcal{P} = \{ p_{\theta} \mid \theta \in \Theta \}$ and seek to find the parameter vector $\boldsymbol{\theta}$ that best approximates $P_{\mathbf{Y}|\mathbf{X}}$. In essence, are we looking to minimize the statistical distance between the true and estimated distributions, which characterized by the Kullback-Leibler (KL) divergence (or relative entropy), and can be described as

$$ D_{\text{KL}}[P \parallel Q] = \mathbb{E}_{P_\mathbf{X}} \left[ \log \frac{P(\mathbf{x})}{Q(\mathbf{x})} \right] = \sum_{\mathbf{x}} P(\mathbf{x}) \log \frac{P(\mathbf{x})}{Q(\mathbf{x})}, $$

describing the information lost when you approximate a true distribution $P$ with a modeled $Q$. The KL divergence has some crucial properties such as non-negativity, asymmetry ($D_{\text{KL}}[P \parallel Q] \neq D_{\text{KL}}[Q \parallel P]$) and convexity. In particular, the asymmetric property of the KL-divergence results in an interesting design choice. The forward KL, $D_{\text{KL}}[P \parallel Q]$ is mode covering, penalizes $Q$ assigning low probability to an outcome $\mathbf{x}$ likely under $P$, while the reverse KL, $D_{\text{KL}}[Q \parallel P]$ is mode seeking, and penalizes $Q$ generating samples $\mathbf{x}$ unlikely under $P$.

Considering our classification problem, minimizing the KL divergence between the true and approximated model distribution can be expressed as

$$ \begin{align} D_{\text{KL}} [ P_{\mathbf{Y} \mid \mathbf{X}} \,\|\, p_{\mathbf{Y} \mid \mathbf{X}}^{\boldsymbol{\theta}} ] &= \int \log \frac{P(\mathbf{y} \mid \mathbf{x})}{p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta})} P(\mathbf{y} \mid \mathbf{x}) \, \mathrm{d} \mathbf{y} \\ &\approx \mathbb{E}_{P_{\mathbf{Y} \mid \mathbf{X}}} \left[ \log \frac{P(\mathbf{y} \mid \mathbf{x})}{p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta})} \right] \\ &= \mathbb{E}_{P_{\mathbf{Y} \mid \mathbf{X}}} \left[ \log P(\mathbf{y} \mid \mathbf{x}) - \log p(\mathbf{y} \mid \mathbf{x}, \boldsymbol{\theta}) \right]. \end{align} $$

The term $ P(\mathbf{y} \mid \mathbf{x}) $ is constant. Hence, minimizing the KL divergence is congruent to maximizing the likelihood. The Law of Large Numbers guarantees that the empirical log-likelihood converges to the true expected log-likelihood, i.e.,

$$ \mathbb{E}_{P_{\mathbf{Y}|\mathbf{X}}} [\log p(\mathbf{y}\mid \mathbf{x}, \boldsymbol{\theta})] \approx \frac{1}{N} \sum_{n=1}^{N} \log p(\mathbf{y}\mid \mathbf{x}, \boldsymbol{\theta}). $$

This confirms that unknown distributions can be effectively approximated using sample-based MLE. In that sense, MLE can be viewed as minimizing an empirical estimate of the cross-entropy between the data-generating distribution and the model family. When a prior $p(\boldsymbol{\theta})$ is included, MAP corresponds to minimizing the same empirical objective augmented with a regularization term $\frac{1}{N} \log p(\boldsymbol{\theta})$. Finally, if the true conditional $P_{\mathbf{Y}\mid\mathbf{X}}$ is not contained in the model class, the estimator converges (in the large-sample limit) to the parameters $\boldsymbol{\theta}^*$ that yields the closest approximation in KL divergence, i.e., the best-in-class model.

Closing thoughts

In reality, probabilistic models retain what we typically assume away. Rather than collapsing uncertainty into point estimates, they maintain full distributions over quantities of interest. Of course, treating every variable as a continuous distribution is rarely feasible. As we have shown, a common and practical middle ground is to keep parameters as point estimates while modeling the output as a distribution. However, in our examples so far, we have only learned class-conditional distributions $p(y \mid x)$. But what if we want to go further and learn the full data distribution $p(x)$ itself? This is a fundamentally harder problem, and tackling it requires introducing latent variables, which are hidden representations that capture the underlying structure of the data. Instead of modeling the observations directly, we learn a compact representation and a mapping from this latent space back to data space. This will be treated in another blog post.