EM Algorithm

Author

Jiaye Xu

Published

May 4, 2023

Introduction

The expectation–maximization (EM) algorithm is an iterative optimization strategy motivated by a notion of missingness and by consideration of the conditional distribution of what is missing given what is observed.

The popularity of the EM algorithm stems from how simple it can be to implement and how reliably it can find the global optimum through stable, uphill steps.

  • In a frequentist setting, we may conceive of observed data generated from random variables \(X\) along with missing or unobserved data from random variables \(Z\). We envision complete data generated from \(Y = (X, Z)\). Given observed data \(x\), we wish to maximize a likelihood \(L(\theta \vert x)\). Often it will be difficult to work with this likelihood and easier to work with the densities of \(Y\vert \theta\) and \(Z\vert(x, \theta)\). The EM algorithm sidesteps direct consideration of \(L(\theta \vert x)\) by working with these easier densities.

  • The missing data may not truly be missing: They may be only a conceptual ploy that simplifies the problem. In this case, \(Z\) is often referred to as latent. It may seem counter-intuitive that optimization sometimes can be simplified by introducing this new element into the problem – so-called data augmentation method.

  • In a Bayesian application, interest often focuses on estimating the mode of a posterior distribution \(f(\theta \vert x)\). Again, optimization can sometimes be simplified by consideration of unobserved random variables \(\psi\) in addition to the parameters of interest, \(\theta\).

The EM Algorithm

Explanations of why EM algorithm works in general

Let \(X\) be the observed variable (observed data), \(Z\) is the latent variable (missing data), so that the complete data is \(Y=(X, Z)\). We thus assume that we observe \(X_1, \ldots, X_n\), jointly distributed from \(g(x \vert \theta)\) that satisfies \[g(x \vert \theta)=\int_{\mathcal Z}f(x,z\vert \theta)d \mathbf{z}\] and our goal is to compute \(\hat \theta=\arg \max L(\theta \vert x)=\arg \max g( x\vert \theta)\).

Since the augmented data is \(\mathbf z\), where \((X, Z)\sim f(x,z \vert \theta)\), the conditional distribution of the latent variable \(Z\) given the observed data x is \[k(z\vert \theta, x)=\frac{f(x,z \vert \theta)}{g(x \vert \theta)}.\] Equivalently, \(g(x \vert \theta)=f(x,z \vert \theta)/k(z\vert \theta, x).\) Taking the logarithm of this expression leads to the following relationship between the complete-data likelihood \(L(\theta \vert x, z)\) and the observed-data likelihood \(L(\theta \vert x)\), that is, in terms of log-likelihood, we have \[\mathcal l(\theta ; X)=\mathcal l_0(\theta ; Y)-\mathcal l_1(\theta; Z\vert X).\] Taking conditional expectations with respect to the distribution of \(Z\vert X\), i.e., \(k(\mathbf z \vert \theta_0, \mathbf x)\) , for any value \(\theta_0\),

\[\mathcal l(\theta ; X)=\mathbf E_{\theta_0} \left[\mathcal l_0(\theta ; Y)\right]-\mathbf E_{\theta_0} \left[\mathcal l_1(\theta; Z\vert X)\right].\]

In the EM algorithm, while we aim at maximizing \(\mathcal l(\theta ; X)\), only the first term on the right side of above equation will be considered. Denoting

\[Q\left({\theta} \mid {\theta}_0, \mathbf x\right)=\mathbf E_{\theta_0} \left[\mathcal l_0(\theta ; Y)\right],\]

the EM algorithm indeed proceeds iteratively by maximizing \(Q\left({\theta} \mid {\theta}_0, \mathbf x\right)\) at each iteration and, if \(\theta^{(1)}\) is the value of \(\theta\) maximizing \(Q\left({\theta} \mid {\theta}_0, \mathbf x\right)\), by replacing \(\theta _0\) by the updated value \(\theta^{(1)}\). In this manner, a sequence of estimators \(\{\theta^{(t)} \}_t\) is obtained, where \(\theta^{(t)}\) is defined as the value of \(\theta\) maximizing \(Q\left({\theta} \mid {\theta}^{(t-1)}, x\right)\); that is, \[Q\left({\theta}^{(t)} \mid {\theta}^{(t-1)}, \mathbf x\right)=\underset \theta \max Q\left({\theta} \mid {\theta}^{(t-1)}, \mathbf x\right).\]

This iterative scheme thus contains both an expectation step and a maximization step, giving the algorithm its name.

Summary of the EM Algorithm

EM is initiated from \(\theta^{(0)}\) then alternates between two steps: E for expectation and M for maximization.

  1. E step: Compute \(Q\left({\theta} \mid {\theta}^{(t)}, \mathbf x\right)=\mathbf E_{\theta^{(t)}} \left[\mathcal l_0(\theta ; \mathbf x, \mathbf Z)\right]\).

  2. M step: Maximize \(Q\left({\theta} \mid {\theta}^{(t)}, \mathbf x\right)\) with respect to \(\theta\), and take \[\theta^{(t+1)}=\arg \underset \theta \max Q\left({\theta} \mid {\theta}^{(t)}, \mathbf x\right),\] and set \(t=t+1\) until a a stopping criterion has been met, i.e., a fixed point reached \(\theta^{(t+1)}=\theta^{(t)}\).

Two-Component Gaussian Mixture

Gaussian Mixture Models

Assuming you believe that the data sources are actually a Gaussian distribution, the Gaussian mixture models are what you would like to model the density of the data points with.

If our observations \(X_j\) come from a mixture model with \(K\) mixture components, the marginal probability distribution of \(X_j\) is of the form: \[P(X_j = x) = \sum_{k=1}^K \pi_kP(X_j=x|Z_j=k)\] where \(Z_j \in \{1,\ldots,K\}\) is the latent variable representing the mixture component for \(X_j\), \(P(X_j|Z_j)\) is the mixture component, and \(\pi_k\) is the mixture proportion representing the probability that \(X_j\) belongs to the \(k\)-th mixture component.

EM Algorithm for Two-Component Gaussian Mixture

For simplicity without loss of generality, we consider an example of Two-Component Gaussian Mixture Model. Assuming there seems to be two separate underlying regimes, so instead we model \(X\) as a mixture of two Gaussian distributions: \[\begin{align}X_1 &\sim \mathcal N(\mu_1, \sigma_1^2),\\X_2 &\sim \mathcal N(\mu_2, \sigma_2^2),\\X&=(1-\delta)X_1 + \delta X_2\end{align}\] where \(\delta \in \{0,1\}\) with \(P(\delta=1)=\pi\).

Comments:

  • In this example, \(\delta\) is the latent variable corresponding to \(Z\) in the general algorithm.

  • The complete data \(\mathbf y=(\mathbf x, \mathbf z)\) is \((\mathbf x, \delta)\) is this example.

Let \(\phi_{\theta}(x)\) denote the normal density with parameters \(\theta=(\mu, \sigma^2)\), then the density of \(X\) is

\[g_X(x)=\left(1-\pi\right)\phi_{\theta_1}(x)+\pi\phi_{\theta_2}(x)\]

The log-likelihood based on the \(N\) observed-data is

\[ \mathcal l(\theta; X)=\sum_{j=1}^{N}\log\left[\left(1-\pi\right)\phi_{\theta_1}(x_j)+\pi \phi_{\theta_2}(x_j) \right] \]

Suppose we knew the values of the \(\delta_j\)’s, then the complete-data log-likelihood would be

\[ \mathcal l_0(\theta; X, \delta)= \sum_{j=1}^{N}\left[\left(1-\delta_j\right)\log\phi_{\theta_1}(x_j)+\delta_j\log\phi_{\theta_2}(x_j)\right] + \sum_{j=1}^{N}\left[\left(1-\delta_j\right)\log(1-\pi)+\delta_j\log\pi\right] \]

Since the values of the \(\delta_j\)’s are actually unknown, we proceed in an iterative fashion, substituting for each \(\delta_j\) with its expected value

\[\gamma_j(\theta)=\mathbf E(\delta_j\vert\theta, \mathbf x)=P(\delta_j=1\vert\theta, \mathbf x),\]

which is also called the responsibility of model \(2\) for observation \(j\). Furthermore,

\[ \gamma_j(\theta)=\frac{P(\delta_j=1)P(\mathbf x\vert \theta, \delta_j=1)}{P(\mathbf x\vert \theta)}=\frac{\pi \phi_{\theta_2}(x)}{g_X(x)}, \]

That is, \[ \gamma_j(\theta)=\frac{\pi \phi_{\theta_2}(x)}{(1-\pi)\phi_{\theta_1}(x)+\pi \phi_{\theta_2}(x)}. \]

Therefore, the general \(Q\left({\theta} \mid {\theta}_0, \mathbf x\right)\) in this Gaussian mixture example has the form \[\begin{align}Q\left({\theta} \mid {\theta}_0, \mathbf x\right)&=\sum_{j=1}^{N}\left[\left(1-\gamma_j\right)\log\phi_{\theta_1}(x_j)+\gamma_j\log\phi_{\theta_2}(x_j)\right] + \sum_{j=1}^{N}\left[\left(1-\gamma_j\right)\log(1-\pi)+\gamma_j\log\pi\right]\\ &=\sum_{j=1}^{N}\left(1-\gamma_j\right)\left[-\frac{1}{2}\log(2\pi\sigma_1^2)-\frac{1}{2\sigma_1^2}(x_j-\mu_1)^2\right]+\sum_{j=1}^{N}\gamma_j\left[-\frac{1}{2}\log(2\pi\sigma_2^2)-\frac{1}{2\sigma_2^2}(x_j-\mu_2)^2\right]\\ &+\sum_{j=1}^{N}\left[\left(1-\gamma_j\right)\log(1-\pi)+\gamma_j\log\pi\right]\\ \end{align}\]

Maximizing \(Q\left({\theta} \mid {\theta}_0, \mathbf x\right)\) w.r.t the parameters, we obtain the values for the next iteration, which are \[\begin{align}\hat\pi&=\frac{\sum_{j=1}^{N}\gamma_j}{N},\\ \hat\mu_1&=\frac{\sum_{j=1}^{N}\left(1-\gamma_j\right)x_j}{\sum_{j=1}^{N}\left(1-\gamma_j\right)},\\ \hat\mu_2&=\frac{\sum_{j=1}^{N}\gamma_j x_j}{\sum_{j=1}^{N}\gamma_j},\\ \hat\sigma_1^2&=\frac{\sum_{j=1}^{N}\left(1-\gamma_j\right)\left(x_j- \hat\mu_1\right)^2}{\sum_{j=1}^{N}\left(1-\gamma_j\right)},\\ \hat\sigma_2^2&=\frac{\sum_{j=1}^{N}\gamma_j\left(x_j- \hat\mu_2\right)^2}{\sum_{j=1}^{N}\gamma_j}. \end{align}\]

Summary of the EM Algorithm for Two-Component Gaussian Mixture

EM is initiated from \(\theta^{(0)}\) then alternates between two steps: E for expectation and M for maximization.

  1. E step: Compute the responsibilities \[ \gamma_j(\theta)=\frac{\hat\pi \phi_{\hat\theta_2}(x_j)}{(1-\hat\pi)\phi_{\hat\theta_1}(x_j)+\hat\pi \phi_{\hat\theta_2}(x_j)}, \] for \(j=1,2,\ldots, N.\)

  2. M step: Compute \(\hat \theta=(\hat\pi,\hat\theta_1,\hat\theta_2)=(\hat\pi,\hat\mu_1,\hat\sigma_1^2,\hat\mu_2,\hat\sigma_2^2)\), i.e., \[\begin{align}\hat\pi&=\frac{\sum_{j=1}^{N}\hat\gamma_j}{N},\\ \hat\mu_1&=\frac{\sum_{j=1}^{N}\left(1-\hat\gamma_j\right)x_j}{\sum_{j=1}^{N}\left(1-\hat\gamma_j\right)},&\hat\sigma_1^2=\frac{\sum_{j=1}^{N}\left(1-\hat\gamma_j\right)\left(x_j- \hat\mu_1\right)^2}{\sum_{j=1}^{N}\left(1-\hat\gamma_j\right)} ,\\ \hat\mu_2&=\frac{\sum_{j=1}^{N}\hat\gamma_j x_j}{\sum_{j=1}^{N}\hat\gamma_j}, &\hat\sigma_2^2=\frac{\sum_{j=1}^{N}\hat\gamma_j\left(x_j- \hat\mu_2\right)^2}{\sum_{j=1}^{N}\hat\gamma_j}. \end{align}\]

  3. Iterate steps 1 and 2 until convergence.

Implementation of Two-Component Gaussian Mixture in R

In this example, we will assume our mixture components are fully specified Gaussian distributions (i.e the means and variances are known), and we are interested in finding the maximum likelihood estimates of the \(\pi_k\)’s for \(k=1,2\).

Assume we have \(2\) components: \[\begin{align} X_1 = 0 &\sim N(5, 1.5^2) \\ X_2 = 1 &\sim N(10, 2^2) \\ \end{align}\]

The true mixture proportions will be \(P(\delta_j = 0) = 0.25\) and \(P(\delta_j = 1) = 0.75\).

First we simulate data from this mixture model,

set.seed(1234)
# mixture components
mu.true    = c(5, 10)
sigma.true = c(1.5, 2)

# delta_j (the latent variable Z)
Z = rbinom(500, 1, 0.75)

# sample from mixture model (N=10000)
X <- rnorm(10000, mean=mu.true[Z+1], sd=sigma.true[Z+1])
hist(X,breaks=15)

Now we write a function to compute the log-likelihood for the incomplete data (the observed), assuming the parameters are known. This will be used to determine convergence: \[\ell(\theta) = \sum_{j=1}^n \log \left( \sum_{k=1}^2 \pi_k \underbrace{N(x_j;\mu_k, \sigma_k^2)}_{L[j,k]} \right )\]

compute.log.lik <- function(L, w) {
  L[,1] = L[,1]*w[1]
  L[,2] = L[,2]*w[2]
  return(sum(log(rowSums(L))))
}

Since the mixture components are fully specified, for each sample \(X_i\) we can compute the likelihood \(P(X_j | \delta_j=0)\) and \(P(X_j | \delta_j=1)\), values stored in the columns of \(L\),

L = matrix(NA, nrow=length(X), ncol= 2)
# evaluate the likelihood
L[, 1] = dnorm(X, mean=mu.true[1], sd = sigma.true[1])
L[, 2] = dnorm(X, mean=mu.true[2], sd = sigma.true[2])

The mixture.EM function is for convergence-checking by computing the log-likelihoods at each step. The implements of the E-step and M-step are in the EM.iter function.

mixture.EM <- function(w.init, L) {
  
  w.curr <- w.init
  
  # log-likehoods for each iteration
  log_liks <- c()
  ll       <- compute.log.lik(L, w.curr)
  log_liks <- c(log_liks, ll)
  delta.ll <- 1
  
  while(delta.ll > 1e-5) {
    w.curr   <- EM.iter(w.curr, L)
    ll       <- compute.log.lik(L, w.curr)
    log_liks <- c(log_liks, ll)
    delta.ll <- log_liks[length(log_liks)]  - log_liks[length(log_liks)-1]
  }
  return(list(w.curr, log_liks))
}

EM.iter <- function(w.curr, L, ...) {
  
  # E-step: compute responsibilities, gamma_j.
  z_ik <- L
  for(i in seq_len(ncol(L))) {
    z_ik[,i] <- w.curr[i]*z_ik[,i]
  }
  z_ik     <- z_ik / rowSums(z_ik)
  
  # M-step
  w.next   <- colSums(z_ik)/sum(z_ik)
  return(w.next)
}
# run EM
ee <- mixture.EM(w.init=c(0.5,0.5), L)
print(paste("Estimate = (", round(ee[[1]][1],2), ",", round(ee[[1]][2],2), ")", sep=""))
[1] "Estimate = (0.24,0.76)"

We inspect the evolution of the log-likelihood and note that it is strictly increases,

plot(ee[[2]], ylab='incomplete log-likelihood', xlab='iteration')