Maximum Likelihood Estimation: How It Works

MLE & MAP

One concept that I consistently ran into early on when studying machine learning, yet often felt like it wasn’t given the proper treatment, was maximum likelihood estimation (MLE). Given a model with learnable parameters $\mathbf{\theta}$ and training data $X$ sampled from an unknown discrete distribution, the maximum likelihood estimate (MLE) is defined by:

$$\begin{equation} \label{MLE} \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \ P(X \ | \ \theta) \end{equation}$$

where $P(X \ | \ \theta)$ is our model, representing our estimate for the probability of observing $X$. It’s important to understand what exactly is going on here. Our goal is to find the optimal parameters $\theta$ where our model best fits the distribution that $X$ is sampled from. In other words, we want to find $\theta$ such that $X$ is most likely to have been generated by our model.

An alternative question is: what model parameters $\theta$ are most probable given $X$? This method is known as maximum a posteriori estimation (MAP), which instead defines the optimal parameters as:

$$\begin{equation} \nonumber \hat{\theta}_{\text{MAP}} = \underset{\theta}{\text{argmax}} \ P(\theta \ | \ X)\end{equation}$$

Applying Bayes’ Theorem to the above gives us:

$$\begin{equation} \nonumber \hat{\theta}_{\text{MAP}} = \underset{\theta}{\text{argmax}} \ \frac{P(X \ | \ \theta )P(\theta)}{P(X)} \end{equation}$$

Since our dataset $X$ never changes throughout training, we are only maximizing over $\theta$, thus $P(X)$ is a constant, so we can ignore it by property of the $\text{argmax}$:

$$\begin{equation} \label{step_3} \hat{\theta}_{\text{MAP}} = \underset{\theta}{\text{argmax}} \ P(X \ | \ \theta)P(\theta) \end{equation}$$

MAP is commonly used with the above form when we have some information on the prior distribution $P(\theta)$; however, in many situations, we will have no information on the prior and have no reason to believe it should follow any specialized pattern, therefore we may assume that $\theta \sim \text{uniform}$. This is sometimes controversially referred to as an “uninformative prior”. Given this assumption, this means that $P(\theta)$ is also a constant, which by property of the $\text{argmax}$ again implies:

$$\begin{equation} \nonumber \hat{\theta}_{\text{MAP}} = \underset{\theta}{\text{argmax}} \ P(X \ | \ \theta) = \hat{\theta}_{\text{MLE}} \end{equation}$$

by Equations (\ref{MLE}) and (\ref{step_3}). In this case, $\hat{\theta}_{\text{MLE}}$ and $\hat{\theta}_{\text{MAP}}$ turn out to be equal.


The Log Trick

We often assume our data is i.i.d (independent and identically distributed). Denoting our dataset by $X = \{x^{(1)}, \dots, x^{(m)}\}$, we can then rewrite Equation (\ref{MLE}) as:

$$\begin{eqnarray} \label{MLEiid1} \nonumber \hat{\theta}_{\text{MLE}} &=& \underset{\theta}{\text{argmax}} \ P(X \ | \ \theta) \\ &=& \underset{\theta}{\text{argmax}} \ P(x^{(1)}, \dots, x^{(m)} \ | \ \theta) \label{MLEiid2} \nonumber \\ &=& \underset{\theta}{\text{argmax}} \prod_{i = 1}^m P(x^{(i)} \ | \ \theta) \label{MLEiid3} \end{eqnarray}$$

where the last line follows by independence. This presents one immediate problem. Computing the product of probabilities is prone to numerical underflow (i.e. floating point errors) which is problematic in practice when $m$ is large. We can solve this by using the $\log$ function. Note that $\log$ is an increasing function, meaning that maximizing a function $f(x)$ is equivalent to maximizing $\log(f(x))$ if we are only interested in the maximum point $x^*$. In other words,

$$\begin{equation} \nonumber x^* = \underset{x}{\text{argmax}} f(x) = \underset{x}{\text{argmax}} \log(f(x))\end{equation}$$

assuming $f> 0$ so $\log(f)$ is well-defined. Therefore by Equation (\ref{MLEiid3}) we have:

$$\begin{equation} \nonumber \label{MLElog} \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \log \bigg(\prod_{i = 1}^m P(x^{(i)} \ | \ \theta) \bigg) = \underset{\theta}{\text{argmax}} \sum_{i = 1}^m \log(P(x^{(i)} \ | \ \theta)) \end{equation}$$

where the second equality follows by the additivity of $\log$. Now we need only compute addition instead of multiplication! Solving our numerical underflow problem. By convention, we often take the negative log-likelihood and minimize it instead:

$$\begin{equation} \nonumber \label{MLEneglog} \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmin}} \bigg( -\log \bigg(\prod_{i = 1}^m P(x^{(i)} \ | \ \theta) \bigg)\bigg) = \underset{\theta}{\text{argmin}}\sum_{i = 1}^m (-\log(P(x^{(i)} \ | \ \theta))) \end{equation}$$

This gives rise to an interesting connection in information theory, as you might notice $-\log(P(x^{(i)} \ | \ \theta))$ is the self-information.

Using the $\log$ method has a few other advantages as well. Computing the derivatives with $\log$ is faster due to the product rule:

$$\begin{eqnarray} \nonumber \frac{\partial P(X \ | \ \theta)}{\partial \theta_i} &=& \frac{\partial P(x^{(1)} \ | \ \theta)}{\partial \theta_i} \cdot P(x^{(2)} \ | \ \theta) \cdots P(x^{(m)} \ | \ \theta) \\ \nonumber &+& P(x^{(1)} \ | \ \theta) \cdot \frac{\partial P(x^{(2)} \ | \ \theta)}{\partial \theta_i} \cdots P(x^{(m)} \ | \ \theta) \\ &+& \cdots + P(x^{(1)} \ | \ \theta) \cdot P(x^{(2)} \ | \ \theta) \cdots \frac{\partial P(x^{(m)} \ | \ \theta)}{\partial \theta_i} \nonumber \end{eqnarray}$$

Which takes a total of $m(m-1)$ multiplications and $m$ additions, thus $O(m^2)$ operations. Now if we take the derivative of the $\log$ we get:

$$\begin{eqnarray} \nonumber \log\bigg( \frac{\partial P(X \ | \ \theta)}{\partial \theta_i}\bigg) &=& \sum_{j = 1}^m \frac{\partial}{\partial_i}\log(P(x^{(j)} \ | \ \theta)) \\ \nonumber &=& \sum_{j = 1}^m \frac{1}{P(x^{(j)} \ | \ \theta_i)} \cdot \frac{\partial P(x^{(j)} \ | \ \theta)}{\partial \theta_i} \end{eqnarray}$$

which is a total of $2m$ multiplications and $m$ additions, giving us $O(m)$ operations. Having a lower number of operations when computing the derivative can be particularly useful in gradient descent, as our loss function is commonly based on $P(X \ | \ \theta)$.

Conditional MLE

What about supervised learning? Given a labelled dataset $D = \{(x^{(1)}, y^{(1)}), \dots, (x^{(m)}, y^{(m)})\}$, we train our model to predict the label $y$ given $x$ with parameters $\theta$, therefore we are modeling the distribution $Y \ | \ X$, where $Y = \{y^{(1)}, \dots, y^{(m)}\}$ and $X = \{x^{(1)}, \dots, x^{(m)}\}$. Exactly how we optimized for the probability of seeing $X$ given $\theta$ in the original formulation of MLE (Equation (\ref{MLE})), we now optimize for the probability to observe $D$ given $\theta$:

$$\begin{equation} \label{condMLE1} \nonumber \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \ P(D \ | \ \theta) \end{equation}$$

By similar i.i.d. assumptions (on the random variable(s) $(x^{(i)}, y^{(i)})$ were realized from), we can write the above as a product:

$$\begin{eqnarray} \label{iidcond1} \nonumber \hat{\theta}_{\text{MLE}} &=& \underset{\theta}{\text{argmax}} \ P(D \ | \ \theta) \\ &=& \nonumber \label{iidcond2} P((x^{(1)}, y^{(1)}), \dots,(x^{(m)}, y^{(m)} \ | \ \theta) \\ \label{iidcond3} &=& \prod_{i =1}^m P(x^{(i)}, y^{(i)} \ | \ \theta) \end{eqnarray}$$

Now, for each $1 \leq i \leq m$ we may rewrite (see here for how):

$$\begin{equation} P(x^{(i)}, y^{(i)} \ | \ \theta) = P(y^{(i)} \ | \ x^{(i)}, \theta) P(x^{(i)} \ | \ \theta) \nonumber \end{equation}$$

Since we are only parametrizing the conditional distribution $Y \ | \ X$, i.e. $Y \ | \ X$ is only parametrized by $\theta$ and not $X$, then we can ignore $P(x^{(i)} \ | \ \theta)$, resulting in:
$$\begin{equation} P(x^{(i)}, y^{(i)} \ | \ \theta) = P(y^{(i)} \ | \ x^{(i)}, \theta) \nonumber \end{equation}$$

for each $1 \leq i \leq m$. Plugging this into Equation (\ref{iidcond3}) we have:
$$\begin{eqnarray} \hat{\theta}_{\text{MLE}} &=& \underset{\theta}{\text{argmax}} \prod_{i =1}^m P(x^{(i)}, y^{(i)} \ | \ \theta) \nonumber \\ &=& \underset{\theta}{\text{argmax}} \prod_{i =1}^m P(y^{(i)} \ | \ x^{(i)}, \theta) \nonumber \end{eqnarray}$$

Applying the $\log$ trick and minimizing the negative using $\text{argmin}$ we get the canonical expression:

$$\begin{equation} \hat{\theta}_{\text{MLE}} = \underset{\theta}{\text{argmax}} \ P(Y \ | \ X, \theta) = \underset{\theta}{\text{argmin}} \ \sum_{i = 1}^m (-\log( P(y^{(i)} \ | \ x^{(i)}, \theta))) \nonumber \end{equation}$$

Previous
Previous

Self-Supervised Learning: An Introduction