The EM algorithm is used to find (local) maximum likelihood parameters of a statistical model in cases where the equations cannot be solved directly. Typically these models involve latent variables in addition to unknown parameters and known data observations. In this post, we will talk about how the algorithm works and then prove its correctness, finally we will show a concrete yet the most common use case where the algorithm is applied.

Description

The typical setting of the EM algorithm is a parameter estimation problem in which we have

A training set $\mathbf{X} = \{ x_1, \ldots, x_n \}$ consisting of $n$ independent observed data points. Each may be discrete or continuous. Associated with each data point may be a vector of observations $\mathbf{Y} = \{ y_1, \ldots, y_n \}$ .
A set of unobserved latent data or missing values $\mathbf{Z} = \{ z_1, \ldots, z_n \}$ associated with each data point. They are discrete, drawn from a fixed number of values, and with one latent variable per observed unit.
A vector of unknown parameters $\boldsymbol{\theta}$ $θ$ which are continuous, and are of two kinds:
- Parameters that are associated with all data points
- Those associated with a specific value of a latent variable (i.e., associated with all data points which corresponding latent variable has that value).

We wish to fit the parameters of a model $p(x ; \boldsymbol{\theta})$ (may or may not include $y$ ) and the log marginal likelihood function of all data points to maximize is given by

\begin{aligned} & L(\boldsymbol{\theta};\mathbf{X}) \\ = \ \ &\sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\ = \ \ &\sum_{i=1}^{n} \log \sum_{z_i} p(x_i, z_i ; \boldsymbol{\theta}) \end{aligned}

However, maximizing $L(\boldsymbol{\theta};\mathbf{X})$ explicityly might be difficult because $\mathbf{Z}$ are the latent random variables. So let’s refine it by Bayes’ theorem:

\begin{aligned} & \sum_{i=1}^{n} \log p(x_i ; \boldsymbol{\theta}) \\ = \ \ & \sum_{i=1}^{n} (\log p(x_i, z_i ; \boldsymbol{\theta}) - \log p(z_i | x_i; \boldsymbol{\theta})) \\ = \ \ & \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta}) \end{aligned}

Denote:

L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta})

Then:

L(\boldsymbol{\theta};\mathbf{X}) = L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) - \sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta})

This equation will be used to prove the correctness. Let’s denote it as equation $(eq.*)$

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying these two steps:

Expectation step (E step)

Define $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ as the expected value of the log likelihood function of $\boldsymbol{\theta}$ , with respect to the current conditional distribution of $\mathbf{Z}$ , given $\mathbf{X}$ and the current estimates of the parameters $\boldsymbol{\theta}^{(t)}$

\begin{aligned} & Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ \sum_{i=1}^{n} \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\ = \ \ & \sum_{i=1}^{n} \mathrm{E}_{z_i \mid x_i; \boldsymbol{\theta}^{(t)}} \left [ \log p(x_i, z_i ; \boldsymbol{\theta}) \right ] \\ = \ \ & \sum_{i=1}^{n} \sum_{z_i} p(z_i \mid x_i; \boldsymbol{\theta}^{(t)}) \log p(x_i, z_i ; \boldsymbol{\theta}) \end{aligned}

Maximization step (M step)

Find the parameters that maximize this quantity:

\boldsymbol{\theta}^{(t+1)} = \underset{\boldsymbol{\theta}}{\arg \max} \, Q\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right)

Proof of Correctness

The trick of EM algorithm is to improve $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ rather than directly improving $L(\boldsymbol{\theta};\mathbf{X})$ . Here is shown that improvements to the former imply improvements to the latter.

We take the expectation over possible values of the unknown data $z_i$ under the current parameter estimate $\boldsymbol{\theta}^{(t)}$ for both sides of $(eq.*)$ equation by multiplying $p(z_i \mid x_i; \boldsymbol{\theta}^{(t)})$ and summing (or integrating) over $z$ . The left-hand side is the expectation of a constant since it is independent of $z_i$ , so we get:

\begin{aligned} & L(\boldsymbol{\theta};\mathbf{X}) \\ = \ \ & \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left [ L(\boldsymbol{\theta}; \mathbf{X}, \mathbf{Z}) \right ] - \mathrm{E}_{z \mid x; \boldsymbol{\theta}^{(t)}} \left[\sum_{i=1}^{n} \log p(z_i | x_i; \boldsymbol{\theta}) \right] \\ = \ \ & Q \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right) - \sum_{i=1}^{n}\sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}) \\ = \ \ & Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \end{aligned}

where $H_i \left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}\right)$ is defined by the negated sum it is replacing.

This last equation holds for every value of $\boldsymbol{\theta}$ including $\boldsymbol{\theta} = \boldsymbol{\theta}^{(t)}$ ,

L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

and subtracting this last equation from the previous equation gives

L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) = Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) + H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

Note that

\begin{aligned} & H \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - H \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right) \\ = \ \ & \sum_{i=1}^{n} \underbrace{- \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta})}_{cross-entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)}, z_i \mid x_i; \boldsymbol{\theta})} - \underbrace{\left[ - \sum_{z_i} p \left( z_i \mid x_i; \boldsymbol{\theta}^{(t)} \right) \log p(z_i \mid x_i ; \boldsymbol{\theta}^{(t)}) \right]}_{entropy(z_i \mid x_i; \boldsymbol{\theta}^{(t)})} \end{aligned}

Based on Gibbs’ inequality: the information entropy of a distribution P is less than or equal to its cross entropy with any other distribution Q, which means

H\left(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) \geq H\left(\boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

Hence, we can conclude that

L(\boldsymbol{\theta};\mathbf{X}) - L\left( \boldsymbol{\theta}^{(t)};\mathbf{X} \right) \geq Q \left( \boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)} \right) - Q \left( \boldsymbol{\theta}^{(t)} \mid \boldsymbol{\theta}^{(t)} \right)

In words, choosing $\boldsymbol{\theta}$ to improve $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$ causes $L(\boldsymbol{\theta};\mathbf{X})$ to improve at least as much.

Alternatively, you can also prove the correctness by using Jensen’s equality. More details in this note.

Gaussian Mixture Model (GMM)

A typical application of EM algorithm is the gaussian mixture model (GMM) which is a clustering problem as the following:

A training set $\mathbf{X} = \{ x_1, \ldots, x_n \}$ which is a sample of $n$ independent observations from a mixture of $k$ multivariate normal distributions of dimension $d$ . (since we are in the unsupervised learning setting, these points do not come with any labels.)
Associated with the latent variables $\mathbf{Z} = \{ z_1,z_2,\ldots,z_n \}$ that determine the component from which the observation originates. $z_i \sim \text { Multinomial }(\phi)$ where $\phi_j \geq 0, \sum_{j=1}^{k} \phi_j=1$ and the parameter $\phi_j$ gives $p(z_i=j)$ .
The model posits that each $x_i$ was generated by randomly choosing $z_i$ from ${1, \ldots, k}$ , and then $x_i$ was drawn from one of $k$ Gaussians depending on $z_i$ . Then, $x_i | z_i=j \sim \mathcal{N}\left(\mu_{j}, \Sigma_{j}\right)$
The parameters to estimate $\boldsymbol{\theta} = \{\phi_j, \mu_j, \Sigma_j | j \in 1,\ldots,k\}$ . For simplicity, $\theta_j = \{\phi_j, \mu_j, \Sigma_j\}$

And we wish to model the data by specifying the joint distribution $p(x_i, z_i)$ . Hence, the log likelihood is

\begin{aligned} & L(\boldsymbol{\theta} ; \mathbf{X}) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i, z_i = j; \theta_j) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} p(x_i \mid z_i = j; \mu_j, \Sigma_j) \cdot p(z_i = j; \phi_j) \\ = \ \ & \sum_{i=1}^{n} \log \sum_{j=1}^{k} f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j \end{aligned}

However, if we set to zero the derivatives of this formula with respect to the parameters and try to solve, we’ll find that it is not possible to find the maximum likelihood estimates of the parameters in closed form. (Try this yourself at home.) This is where the EM algorithm comes in.

E-step

Define $Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)})$
$L(\theta ; \mathbf{X}, \mathbf{Z}) = \sum_{i=1}^{n} \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_{j}$
where
$f\left(x_i ; \mu_j, \Sigma_j\right) = \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right)$
then
$Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) = \sum_{i=1}^{n} \sum_{j=1}^{k} p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \cdot \log f\left(x_i ; \mu_j, \Sigma_j\right) \cdot \phi_j$
Compute $w_{ij}^{(t)}$
Given our current estimate of the parameters $\theta^{(t)}$ , the conditional distribution of the $z_i$ is determined by Bayes theorem to be normalized Gaussian density weighted by $\phi_j$ :
$\begin{aligned} & w_{ij}^{(t)} \\ = \ \ & p(z_i = j \mid x_i; \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{p(x_i; \boldsymbol{\theta}^{(t)})} \\ = \ \ & \frac{p(x_i, z_i = j ; \boldsymbol{\theta}^{(t)})}{\sum_{l = 1}^{k} p(x_i, z_i = l; \boldsymbol{\theta}^{(t)})} \\ = \ \ & \frac{f\left(x_i ; \mu_j^{(t)}, \Sigma_j^{(t)}\right) \cdot \phi_{j}^{(t)}}{\sum_{l = 1}^{k} f\left(x_i ; \mu_l^{(t)}, \Sigma_l^{(t)}\right) \cdot \phi_{l}^{(t)}} \end{aligned}$

M-step

We need to maximize, with respect to our parameters $\boldsymbol{\theta}^{(t)}$ , the quantity

\begin{aligned} & Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \frac{1}{(2 \pi)^{d / 2}\left|\Sigma_{j}\right|^{1 / 2}} \exp \left(-\frac{1}{2}\left(x_i-\mu_{j}\right)^{T} \Sigma_{j}^{-1}\left(x_i-\mu_{j}\right)\right) \cdot \phi_{j} \\ = \ \ & \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \left[\log \phi_j - \frac{1}{2} \log \left|\Sigma_{j}\right| - \frac{1}{2}(x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) - \frac{d}{2} \log(2\pi) \right] \end{aligned}

Update $\phi_j$
Grouping together only the terms that depend on $\phi_j$ , we find that we need to maximize
$\sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j$
with subject to
$\sum_{j=1}^{k} \phi_j = 1$
So we construct the Lagrangian
$\mathcal{L}(\phi) = \sum_{i=1}^{n}\sum_{j=1}^{k} w_{ij}^{(t)} \log \phi_j + \lambda \left( \sum_{j=1}^{k} \phi_j - 1 \right)$
where $\lambda$ is the Lagrange multiplier. Taking derivative, we find
$\frac{\partial}{\partial \phi_{j}} \mathcal{L}(\phi)=\sum_{i=1}^{n} \frac{w_{ij}^{(t)}}{\phi_{j}}+\lambda$
Setting this to zero and solving, we get
$\phi_{j}=\frac{\sum_{i=1}^{n} w_{ij}^{(t)}}{-\lambda}$
Using the constraint that $\sum_{j=1}^{k} \phi_j = 1$ and knowing the fact that $\sum_{j=1}^{k} w_{ij}^{(t)} = 1$ (probabilities sum to 1), we easily find:
$-\lambda=\sum_{i=1}^{n} \sum_{j=1}^{k} w_{ij}^{(t)} = \sum_{i=1}^{n} 1= n$
We therefore have updates for the parameters $\phi_j$ :
$\phi_{j} :=\frac{1}{n} \sum_{i=1}^{n} w_{ij}^{(t)}$
Update $\mu_j$
$\begin{aligned} & \frac{\partial}{\partial \mu_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{\partial}{\partial \mu_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\ = \ \ & \frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \mu_{j}} \left( 2 \mu_j^T\Sigma_j^{-1}x_i - \mu_j^T\Sigma_j^{-1}\mu_j \right) \\ = \ \ & \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_j^{-1} x_i - \Sigma_j^{-1}\mu_j \right) \end{aligned}$
Setting this to zero and solving for $\mu_j$ therefore yields the update rule
$\mu_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} x_i}{\sum_{i=1}^{n} w_{ij}^{(t)}}$
Update $\Sigma_j$

$\begin{aligned} & \frac{\partial}{\partial \Sigma_{j}} Q(\boldsymbol{\theta} \mid \boldsymbol{\theta}^{(t)}) \\ = \ \ & \frac{\partial}{\partial \Sigma_{j}} \sum_{i=1}^{n} -\frac{1}{2} w_{ij}^{(t)} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\ = \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \dfrac{\partial}{\partial \Sigma_{j}} \left[ \log\left|\Sigma_{j}\right| + (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j)\right] \\ = \ \ & -\frac{1}{2} \sum_{i=1}^{n} w_{ij}^{(t)} \left( \Sigma_{j}^{-1} - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \right) \end{aligned}$

Setting the partial derivative to zero and solving for $\Sigma_j$ therefore yields the update rule

$\Sigma_{j} :=\frac{\sum_{i=1}^{n} w_{ij}^{(t)} (x_i-\mu_j)(x_i - \mu_j)^T}{\sum_{i=1}^{n} w_{ij}^{(t)}}$

More details on the derivative calculus
Note that $\Sigma_{j}$ is invertible, symmetric, square, positive-definite matrix, then the following holds
- The inverse of a symmetric matrix is symmetric, then $(\Sigma_{j}^{-1})^T = \Sigma_{j}^{-1}$
- $\Sigma_{j}$ is positive-definite, then $\Sigma_{j}^{-1}$ is positive-definite
- Positive-definite matrix is invertible, then $\Sigma_{j}^{-2}$ exists
So we get

$\begin{aligned} & \dfrac{\partial}{\partial \Sigma_{j}} \log\left|\Sigma_{j}\right| \\ = \ \ & \frac{1}{\left|\Sigma_{j}\right|} \frac{\partial}{\partial \Sigma_{j}} \left|\Sigma_{j}\right| \\ = \ \ & \frac{1}{\left|\Sigma_{j}\right|} \left|\Sigma_{j}\right| \cdot (\Sigma_{j}^{-1})^T \\ = \ \ & \Sigma_{j}^{-1} \end{aligned}$

and

$\begin{aligned} & \frac{\partial}{\partial \Sigma_{j}} (x_i-\mu_j)^T \Sigma_{j}^{-1}(x_i - \mu_j) \\ = \ \ & (x_i-\mu_j)(x_i - \mu_j)^T \frac{\partial}{\partial \Sigma_{j}} \Sigma_{j}^{-1} \\ = \ \ & - (x_i-\mu_j)(x_i - \mu_j)^T \Sigma_{j}^{-2} \\ \end{aligned}$

More about GMM

Compared to K-Means which uses hard assignment, GMM uses soft assignment which a good way to represent uncertainty. But GMM is limited in its number of features since it requires storing a covariance matrix which has size quadratic in the number of features. Even when the number of features does not exceed this limit, this algorithm may perform poorly on high-dimensional data.
This is due to high-dimensional data:

making it difficult to cluster at all (based on statistical/theoretical arguments)
numerical issues with Gaussian distributions