Probabilistic Modeling
Statistical modeling is the practice of describing real-world data using mathematical models with unknown parameters. In machine learning, statistical models are often expressed in probabilistic form, meaning we assume data is generated from a probability distribution. Probabilistic modeling is therefore the practice of describing how data is generated using probability distributions. Instead of treating observations as deterministic, we assume they are generated by an underlying random process, often with noise and uncertainty.
Info
The following sources were consulted in preparing this material:
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 3: Probability and Information Theory.
- Grosse, R. (2020). Lecture 7: Probabilistic Models. CSC 311: Introduction to Machine Learning, University of Toronto.
Important
Some concepts in this material are simplified for pedagogical purposes. These simplifications slightly reduce precision but preserve the core ideas relevant to deep learning.
Likelihood
In probability, we often write expressions like \(p(y \mid \theta)\), where \(\theta\) is a parameter of a model, and \(y\) is a possible outcome. The same expression can be interpreted in two different ways: as a probability or as a likelihood. The likelihood is not a different formula — it is the same function, interpreted differently.
- Probability treats \(\theta\) as fixed and \(y\) as uncertain. It answers: If the model parameter is \(\theta\), how likely is outcome \(y\)?
- Likelihood treats \(y\) as fixed (because we already observed it) and views the same expression as a function of \(\theta\). It answers: Given the observed outcome \(y\), which values of \(\theta\) make it most plausible?
For continuous variables, \(p(y\mid\theta)\) is a probability density rather than a probability. The likelihood function is defined as: $$ L(\theta \mid y) = p(y \mid \theta). $$
Important
Likelihood is not a probability distribution over \(\theta\). In general: $$ \int L(\theta \mid y)\,d\theta \ne 1. $$ Likelihood values only measure relative support for different parameter values of \(\theta\).
Example
A coin toss can be modeled as \(Y \sim \mathrm{Bernoulli}(\theta),\) where \(Y \in \{0,1\}\), and \(\theta\) is the probability of observing \(Y=1\) (e.g., heads). If the coin is weighted, then \(\theta \ne 0.5\). The probability of observing outcome \(y\) is: $$ p(y \mid \theta) = \theta^y(1-\theta)^{1-y}. $$
If \(\theta\) is fixed, this is a probability statement about the random outcome \(Y\). But after observing \(y\), the same expression becomes a likelihood function of \(\theta\): $$ L(\theta \mid y) = \theta^y(1-\theta)^{1-y}. $$
For example, if we observe \(y=1\) (heads), then: $$ L(\theta \mid y=1) = \theta, $$ which is maximized near \(\theta=1\). If we observe \(y=0\) (tails), then: $$ L(\theta \mid y=0) = 1-\theta, $$ which is maximized near \(\theta=0\).
In practice, likelihood values can become extremely small, because they often involve multiplying many probabilities. For this reason, we usually work with the log-likelihood: $$ \log L(\theta \mid y) = \log p(y \mid \theta). $$
Note
Log-likelihood is used because it turns products into sums. If we assume i.i.d. samples \(y^{(1)},\dots,y^{(m)}\), then: $$ p(y^{(1)},\dots,y^{(m)} \mid \theta) = \prod_{i=1}^{m} p(y^{(i)} \mid \theta), $$ so the log-likelihood becomes: $$ \log p(y^{(1)},\dots,y^{(m)} \mid \theta) = \sum_{i=1}^{m} \log p(y^{(i)} \mid \theta). $$
Maximum Likelihood Estimation
Once we choose a probabilistic model \(p(x \mid \theta)\), the main question becomes: which parameter values \(\theta\) best explain the observed dataset? Given a dataset of \(m\) samples: $$ D = {x^{(1)},x^{(2)},\dots,x^{(m)}}, $$ the likelihood of the dataset is: $$ p(D \mid \theta) = \prod_{i=1}^{m} p(x^{(i)} \mid \theta), $$ assuming the samples are i.i.d. Maximum likelihood estimation (MLE) chooses the parameter values that maximize this likelihood:
With log-likelihood, the formula becomes:
Because the logarithm is monotonic, maximizing likelihood and maximizing log-likelihood produce the same solution.
Note
MLE is the standard statistical justification behind many deep learning loss functions. In practice, training often means choosing parameters \(\theta\) so that the observed dataset becomes as likely as possible under the model.
Negative Log-Likelihood
In deep learning, we usually convert MLE into a minimization problem. This leads to the negative log-likelihood loss: $$ \mathcal{L}(\theta) = -\log p(D\mid \theta). $$
Minimizing negative log-likelihood is equivalent to maximizing likelihood, so negative log-likelihood is the most common probabilistic form of a training objective.
Binary Cross-Entropy
In binary classification, the label is \(y \in \{0,1\}\) and the model predicts the probability of the positive class: $$ \hat{p} = p_\theta(Y=1 \mid x), $$ where \(\theta\) represents the model parameters (weights).
Under a Bernoulli model, the likelihood of observing \(y\) is: $$ p_\theta(y\mid x) = \hat{p}^{\,y}(1-\hat{p})^{1-y}. $$
Taking the negative logarithm gives the binary cross-entropy loss:
So binary cross-entropy is exactly the negative log-likelihood of a Bernoulli model.
Example
Suppose the true label is \(y=1\) (positive class). If the model predicts \(\hat{p}=0.9\), then: $$ \mathcal{L}_{\mathrm{BCE}} = -\log(0.9) \approx 0.105. $$
If the model predicts \(\hat{p}=0.1\), then: $$ \mathcal{L}_{\mathrm{BCE}} = -\log(0.1) \approx 2.303. $$
The loss is small when the model assigns high probability to the correct label, and large when it assigns low probability.
Categorical Cross-Entropy
In multiclass classification with \(k\) classes, the label is: $$ y \in {1,2,\dots,k}. $$
The model outputs a probability vector \(\hat{p}\in\mathbb{R}^k\). The likelihood of observing class \(y\) is the probability assigned to that class: $$ p(y\mid x) = \hat{p}_y. $$
Therefore, the negative log-likelihood becomes: $$ \mathcal{L}(y,\hat{p}) = -\log(\hat{p}_y). $$
If we represent the label as a one-hot vector \(e_y\), the same loss can be written as: $$ \mathcal{L}(y,\hat{p}) = -\sum_{i=1}^k e_{y,i}\log(\hat{p}_i). $$
So categorical cross-entropy is exactly the negative log-likelihood of a categorical distribution.
Note
Cross-entropy loss is widely used because minimizing negative log-likelihood is equivalent to maximizing the likelihood of the observed labels under the model. In practice, neural networks usually output logits (unnormalized scores) rather than probabilities. Cross-entropy is computed using numerically stable implementations that combine softmax and log into a single operation: $$ \mathcal{L}(y,z) = -\log\left(\mathrm{softmax}(z)_y\right), $$ where \(z \in \mathbb{R}^k\) is the logits vector.
Example
Suppose we have \(k=3\) classes and the true class is \(y=2\). If the model predicts \(\hat{p}=(0.1,0.8,0.1)\), then: $$ \mathcal{L} = -\log(0.8) \approx 0.223. $$
If the model predicts \(\hat{p}=(0.4,0.2,0.4)\), then: $$ \mathcal{L} = -\log(0.2) \approx 1.609. $$
The loss increases sharply when the model assigns low probability to the correct class.
Bayesian Inference
MLE treats the model parameters \(\theta\) as fixed but unknown. In Bayesian inference, we instead treat \(\theta\) as a random variable and represent uncertainty about its value using probability distributions. Before observing data, we have a certain belief \(p(\theta).\) about the data distribution. After observing a dataset \(D\), we update this belief using Bayes' rule: $$ p(\theta \mid D) = \frac{p(D \mid \theta)\,p(\theta)}{p(D)}. $$
Here:
- \(p(\theta)\) is the prior, representing our belief about \(\theta\) before seeing data.
- \(p(D\mid\theta)\) is the likelihood, measuring how well \(\theta\) explains the observed data.
- \(p(\theta\mid D)\) is the posterior, representing our updated belief after seeing data.
- \(p(D)\) is the marginal likelihood (or evidence), which normalizes the posterior: $$ p(D) = \int p(D \mid \theta)\,p(\theta)\,d\theta. $$
Note
Bayesian inference provides a way to combine prior assumptions with observed data. Instead of producing a single best estimate of parameters, it produces a full probability distribution over plausible parameter values. When the dataset is small, the posterior remains broad. As more data is observed, the posterior typically becomes more concentrated around parameter values that explain the data well.
Maximum A Posteriori Estimation
MLE chooses parameters \(\theta\) that maximize the likelihood of the observed dataset:
In Bayesian inference, we instead compute the posterior distribution: $$ p(\theta \mid D) = \frac{p(D \mid \theta)\,p(\theta)}{p(D)}. $$
Maximum a posteriori estimation (MAP) chooses the parameter value that maximizes the posterior:
Since \(p(D)\) does not depend on \(\theta\), maximizing the posterior is equivalent to:
Taking the logarithm gives the common optimization form:
MAP can be seen as MLE with an additional term \(\log p(\theta)\) that encourages parameter values that are consistent with the prior.
Example
Consider linear regression with Gaussian noise: $$ y = w^\top x + \epsilon, \qquad \epsilon \sim \mathcal{N}(0,\sigma^2). $$
This implies the likelihood: $$ p(D \mid w) \propto \exp\left( -\frac{1}{2\sigma^2}\sum_{i=1}^m (y^{(i)} - w^\top x^{(i)})^2 \right). $$
Maximizing this likelihood (MLE) is equivalent to minimizing the mean squared error. Now assume a Gaussian prior over weights: $$ w \sim \mathcal{N}(0,\tau^2 I). $$
The MAP objective becomes: $$ \hat{w}_{\mathrm{MAP}} = \arg\max_w \Big( \log p(D \mid w) + \log p(w) \Big). $$
Since the Gaussian prior contributes a penalty term proportional to \(\|w\|^2\), MAP becomes equivalent to minimizing: $$ \sum_{i=1}^m (y^{(i)} - w^\top x^{(i)})^2 + \lambda |w|^2. $$
This is exactly \(L_2\) regularization (ridge regression). Therefore, MAP estimation provides a probabilistic justification for weight decay in deep learning.
Latent Variable Models
Quote
But the latent process of which we speak, is far from being obvious to men’s minds, beset as they now are. For we mean not the measures, symptoms, or degrees of any process which can be exhibited in the bodies themselves, but simply a continued process, which, for the most part, escapes the observation of the senses. ~ Francis Bacon (Novum Organum)
In many real-world problems, the observed data \(x\) is influenced by hidden (latent) factors that we do not directly measure. A latent variable model introduces an unobserved random variable \(z\) to represent this hidden structure. The model assumes that data is generated in two steps:
- Latent variable was sampled \(z \sim p(z).\)
- Observation conditioned on \(z\) was generated \(x \sim p(x \mid z).\)
Together, this defines the joint distribution: $$ p(x,z) = p(x \mid z)\,p(z). $$
Since \(z\) is not observed, the probability of an observation \(x\) is obtained by marginalizing over all possible latent values: $$ p(x) = \int p(x,z)\,dz = \int p(x \mid z)\,p(z)\,dz. $$
Note
In practice, latent variable models are powerful because they can represent complex data-generating processes, such as clustering, hidden states, or abstract representations. However, they are also more difficult to train, because computing the marginal likelihood often requires intractable integration (or summation) over latent variables. Many important machine learning models can be viewed as latent variable models, including mixture models, hidden Markov models (HMMs), and variational autoencoders (VAEs).
Mixture Models
In the probability section, we introduced mixture distributions as weighted combinations of simpler distributions:
In probabilistic modeling, the same idea is interpreted as a latent variable model. We assume that each data point was generated by one of \(K\) hidden components. This is modeled by introducing a latent variable: $$ Z \in {1,2,\dots,K}, $$ which represents the unknown component assignment. The generative process is:
- Sample a component index: \(Z \sim \mathrm{Categorical}(\pi_1,\dots,\pi_K).\)
- Sample the observation from the chosen component: \(X \sim p(x \mid Z=k).\)
The marginal distribution of \(X\) is therefore: $$ p(x) = \sum_{k=1}^{K} \pi_k\,p(x \mid Z=k). $$
Note
The key modeling idea is that \(Z\) is not observed. Learning a mixture model means learning both the component distributions and the hidden assignments of data points.
Expectation-Maximization
Many probabilistic models contain latent variables, such as mixture models. In these models, the likelihood involves marginalizing over hidden variables. If the latent variable \(z\) is discrete, the likelihood contains a sum: $$ p(D \mid \theta) = \prod_{i=1}^{m} \sum_{z^{(i)}} p(x^{(i)}, z^{(i)} \mid \theta). $$
If the latent variable \(z\) is continuous, the likelihood contains an integral: $$ p(D \mid \theta) = \prod_{i=1}^{m} \int p(x^{(i)}, z^{(i)} \mid \theta)\,dz^{(i)}. $$
Directly maximizing this likelihood is often difficult, because of the intractable sum (or integral) over latent variables. The Expectation-Maximization (EM) algorithm is an iterative method for maximum likelihood estimation in latent variable models. It alternates between estimating the latent variables (softly) and updating the parameters.
In the E-step (Expectation), we compute the posterior distribution of the latent variable given the current parameters: $$ p(z^{(i)} \mid x^{(i)}, \theta^{(t)}). $$
This gives a soft assignment of each data point to latent states or mixture components.
In the M-step (Maximization), we update the parameters by maximizing the expected log-likelihood under these soft assignments:
This step improves the likelihood by fitting the model parameters to the expected latent structure. EM repeats these two steps until convergence. Each iteration is guaranteed not to decrease the data likelihood.
Example
EM is most commonly associated with Gaussian Mixture Models (GMMs). Consider a GMM with \(K\) Gaussian components: $$ p(x) = \sum_{k=1}^{K} \pi_k\,\mathcal{N}(x;\mu_k,\Sigma_k). $$
Each data point \(x^{(i)}\) is assumed to be generated by an unknown component \(z^{(i)} \in \{1,\dots,K\}\).
E-step. Compute the posterior probability that point \(x^{(i)}\) belongs to component \(k\): $$ \gamma_{ik} = p(z^{(i)}=k \mid x^{(i)},\theta) = \frac{ \pi_k\,\mathcal{N}(x^{(i)};\mu_k,\Sigma_k) }{ \sum_{j=1}^{K} \pi_j\,\mathcal{N}(x^{(i)};\mu_j,\Sigma_j) }. $$
The values \(\gamma_{ik}\) are called responsibilities, and they act as soft cluster assignments.
M-step. Update the parameters using these responsibilities: $$ N_k = \sum_{i=1}^{m} \gamma_{ik}, \qquad \pi_k = \frac{N_k}{m}, $$ $$ \mu_k = \frac{1}{N_k}\sum_{i=1}^{m} \gamma_{ik} x^{(i)}, $$ $$ \Sigma_k = \frac{1}{N_k}\sum_{i=1}^{m} \gamma_{ik} (x^{(i)}-\mu_k)(x^{(i)}-\mu_k)^\top. $$
EM repeats these steps until the parameters converge. Intuitively, the E-step estimates soft cluster memberships, and the M-step recomputes the cluster parameters based on those memberships.
Structured Probabilistic Models
In many deep learning problems, we work with multiple random variables. Modeling the full joint distribution \(p(x_1,\dots,x_n)\) directly is often impractical, because the number of possible interactions grows rapidly with \(n\). Instead, we exploit the fact that most variables interact only with a small subset of others.
A structured probabilistic model (also called a graphical model) represents a joint probability distribution using a graph. Each node represents a random variable, and edges represent direct probabilistic dependencies.
In a directed graphical model (also called a Bayesian network), edges are arrows that indicate conditional dependence. A Bayesian network must be a directed acyclic graph (DAG) (it cannot contain directed cycles). The joint distribution factorizes into conditional probabilities: $$ p(x_1,\dots,x_n) = \prod_{i=1}^{n} p(x_i \mid \mathrm{Pa}(x_i)), $$ where \(\mathrm{Pa}(x_i)\) denotes the parents of \(x_i\) in the graph.
In an undirected graphical model (Markov Random Field), edges do not have direction. Instead of conditional probabilities, the distribution is represented using non-negative functions called potential functions. The joint distribution is written as: $$ p(x) = \frac{1}{Z} \prod_{i=1}^{m} \phi^{(i)}(C^{(i)}), $$ where each \(\phi^{(i)}\) is a potential function over a clique \(C^{(i)}\), and \(Z\) is a normalizing constant called the partition function. Unlike conditional probabilities, potential functions are not required to sum to \(1\). They only need to be non-negative.
Note
In an undirected graph, edges do not represent causal direction. Instead, an edge between two nodes indicates that the variables directly interact.
Generative vs Discriminative Modeling
Probabilistic models are often divided into two categories. A generative model describes how the data is generated by modeling the joint distribution \(p(x,y).\) Using the joint distribution, we can answer different types of questions, such as generating new samples \(x\) or perform classification using Bayes' rule. Examples of generative models include Naive Bayes, GMMs, HMMs, VAEs, etc.
A discriminative model focuses directly on predicting the target variable from the input by modeling: \(p(y \mid x),\) or by learning a direct decision function. Discriminative models do not attempt to model the full data distribution \(p(x)\), and are often preferred when the main goal is prediction. Examples include logistic regression, support vector machines (SVMs), and neural networks trained with cross-entropy.
Note
Generative models are typically more flexible and can be used for sampling and missing-data problems, but they can be harder to train. Discriminative models often achieve better performance in supervised prediction tasks when large labeled datasets are available.
In this page, we introduced probabilistic modeling as a way to describe how data can be generated from probability distributions. We saw how the likelihood function turns observed data into a tool for choosing good model parameters. This leads to maximum likelihood estimation, and to negative log-likelihood, which is the basis of many loss functions used in deep learning. We also introduced Bayesian inference, where model parameters are treated as uncertain and described using probability distributions. MAP estimation was presented as a practical way to combine the likelihood with a prior. Finally, we discussed latent variable models, mixture models, and the Expectation-Maximization algorithm, which are useful when data is generated by hidden factors. We also introduced structured probabilistic models, which use graphs to represent dependencies between random variables, and explained the difference between generative and discriminative modeling.