What is GMM (Gaussian Mixture Model)? Complete Derivation of the EM Algorithm

Gaussian Mixture Model & EM Algorithm

What is GMM?

Definition: GMM (Gaussian Mixture Model)

A Gaussian Mixture Model (GMM) is a probabilistic model that represents the probability distribution of data as a weighted sum of multiple normal (Gaussian) distributions.

$$p(\boldsymbol{x}) = \sum_{m=1}^{M} \pi_m \, \mathcal{N}(\boldsymbol{x} \mid \boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m)$$

Here, $M$ is the number of component distributions, $\pi_m$ is the mixing coefficient ($\sum_m \pi_m = 1$), $\boldsymbol{\mu}_m$ is the mean vector, and $\boldsymbol{\Sigma}_m$ is the covariance matrix.

Intuitive Understanding

Consider a data distribution with multiple peaks that cannot be adequately represented by a single normal distribution. A GMM represents each peak as an independent normal distribution and combines them through mixing coefficients $\pi_m$, thereby flexibly approximating complex distributions. The parameters are estimated using the EM algorithm (Expectation-Maximization).

Applications

  • Clustering: Unlike K-means, each data point can belong to multiple clusters with different probabilities (soft clustering)
  • Anomaly detection: Learn the density of normal data with a GMM and flag points in low-probability regions as anomalies
  • Speech recognition: Modeling the acoustic features of speakers
  • Image processing: Background/foreground separation, texture segmentation
  • Density estimation: As a parametric alternative to kernel density estimation
GMM vs K-means vs Kernel Density Estimation
Feature GMM K-means Kernel Density Estimation
Cluster assignment Soft (probabilistic) Hard (single cluster)
Cluster shape Elliptical (arbitrary covariance) Spherical only
Probabilistic model Yes (generative model) No Yes (nonparametric)
Number of parameters $O(MD^2)$ $O(MD)$ $O(ND)$ (depends on data size)
Estimation method EM algorithm Lloyd's algorithm Bandwidth selection

Introduction (Scope of This Article)

Building on the overview of GMM presented above, this article provides a complete derivation without skipping any intermediate steps of the parameter estimation via the EM algorithm. Specifically, for each of the mean vector, the covariance matrix, and the mixing coefficients, we rigorously derive the maximum likelihood estimator from the partial derivatives of the log-likelihood function using matrix calculus. For the covariance matrix, we adopt the approach of working through the precision matrix (the inverse), combining the derivative of the determinant with the derivative of the quadratic form. The mixing coefficients are estimated via constrained optimization using the method of Lagrange multipliers, and finally, we present a convergence proof of the EM algorithm based on the ELBO (Evidence Lower Bound) and the KL divergence.

Throughout this article, boldface denotes vectors or matrices, and for a matrix $\boldsymbol{A}$, we write $\boldsymbol{A}^{-T}=\left(\boldsymbol{A}^{-1}\right)^T$ for the transpose of its inverse.

Normal Distribution

The $D$-dimensional normal distribution, parameterized by a mean vector $\boldsymbol{\mu}\in\mathbb{R}^D$ and a nonsingular covariance matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{D\times D}$, is given by

\begin{eqnarray} N(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right\} \label{Normal} \end{eqnarray}

Gaussian Mixture Model

We define the Gaussian mixture distribution as a linear combination of normal distributions $N(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma})$ as follows.

\begin{eqnarray} p(\boldsymbol{x}) &=& \sum_{k=1}^K \pi_k N(\boldsymbol{x}|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) \label{GMM} \end{eqnarray}
3D plot of a Gaussian mixture distribution

The mixing coefficients $\pi_k$ correspond to probabilities and satisfy

\begin{eqnarray} 0\leq \pi_k\leq 1 \end{eqnarray}

and

\begin{eqnarray} \sum_{k=1}^K \pi_k &=& 1 \label{sumpi1} \end{eqnarray}

Log-Likelihood of the Mixture Model

Assuming the observed samples $\boldsymbol{x}_n\in\mathbb{R}^D,\ n=1,2,3,\cdots N$ are generated from the Gaussian mixture model in Eq.(\ref{GMM}), we seek the parameters $\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}$ that maximize (more precisely, find a local maximum of) the log-likelihood function

\begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& \log \prod_{n=1}^N p(\boldsymbol{x}_n) \\ &=& \sum_{n=1}^N \log p(\boldsymbol{x}_n) \\ &=& \sum_{n=1}^N \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) \end{eqnarray}

Here, $\boldsymbol{X}$ denotes the data matrix representing the sample set $\{\boldsymbol{x}_1, \boldsymbol{x}_2, \cdots \boldsymbol{x}_N\}$.

\begin{eqnarray} \boldsymbol{X} &=& \left( \begin{array}{ccc} & \boldsymbol{x}_1^T & \\ & \boldsymbol{x}_2^T & \\ & \vdots & \\ & \boldsymbol{x}_N^T & \\ \end{array} \right) \in \mathbb{R}^{N\times D} \end{eqnarray}

Our strategy for parameter estimation exploits the fact that at a local maximum of the log-likelihood function, for all $m=1,2,3,\cdots K$,

\begin{eqnarray} \left\{ \begin{array}{rcl} \displaystyle\frac{\partial}{\partial\boldsymbol{\mu}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& 0\\ \displaystyle\frac{\partial}{\partial\boldsymbol{\Sigma}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& 0\\ \displaystyle\frac{\partial}{\partial\pi_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& 0 \end{array} \right. \end{eqnarray}

Responsibilities

Let $z_n \in \{1, 2, \cdots, K\}$ be a latent variable indicating which normal distribution generated the sample $\boldsymbol{x}_n$, and let $P(z_n = m) = \pi_m$. The posterior probability that the sample $\boldsymbol{x}_n$ was generated by the $m$-th normal distribution can be computed using Bayes' theorem as

\begin{eqnarray} \gamma_{n,m} &=& \frac{P(\boldsymbol{x}_n | z_n = m) P(z_n = m)}{\displaystyle\sum_{k=1}^K P(\boldsymbol{x}_n | z_n = k) P(z_n = k)} \nonumber\\ &=& \frac{N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \cdot \pi_m}{\displaystyle\sum_{k=1}^K N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) \cdot \pi_k} \label{gamma} \end{eqnarray}

We call this quantity the responsibility.

Since the sample $\boldsymbol{x}_n$ was generated by one of the $K$ normal distributions, summing $\gamma_{n,m}$ over $m=1,2,3,\cdots K$ naturally yields a total probability of 1.

\begin{eqnarray} \require{cancel} \sum_{m=1}^K \gamma_{n,m} &=& \sum_{m=1}^K \frac{\pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{i=1}^K \pi_i N(\boldsymbol{x}_n|\boldsymbol{\mu}_i,\boldsymbol{\Sigma}_i)} \\ &=& \frac{\cancel{\displaystyle\sum_{m=1}^K \pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}}{\cancel{\displaystyle\sum_{m=1}^K \pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}} \\ &=& 1 \label{sumgamma} \end{eqnarray}

Deriving $\boldsymbol{\mu}_m$ that Maximizes the Log-Likelihood

In this section, we differentiate the log-likelihood function with respect to the mean vector and apply the vector derivative of a quadratic form to derive the maximum likelihood estimator. The result shows that the mean vector of each cluster is the responsibility-weighted average of the samples.

The following derivation adopts the denominator layout convention: the derivative of a scalar function with respect to a vector is a column vector.

Differentiating the log-likelihood function with respect to $\boldsymbol{\mu}_m$, we obtain

\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\mu}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& \frac{\partial}{\partial\boldsymbol{\mu}_m} \sum_{n=1}^N \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) \\ &=& \sum_{n=1}^N \frac{\partial}{\partial\boldsymbol{\mu}_m} \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k) \\ && \text{In general, } \log'(f(\boldsymbol{x}))=\frac{f'(\boldsymbol{x})}{f(\boldsymbol{x})} \text{, so} \nonumber\\ &=& \sum_{n=1}^N \frac{\displaystyle\frac{\partial}{\partial\boldsymbol{\mu}_m} \displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \\ && \text{Since } \frac{\partial}{\partial\boldsymbol{\mu}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)=0 \text{ when } k\neq m\text{,} \nonumber\\ &=& \sum_{n=1}^N \frac{\pi_m \displaystyle\frac{\partial}{\partial\boldsymbol{\mu}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \label{dmu} \end{eqnarray}

From Eq.(\ref{Normal}), the normal distribution $N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)$ is

\begin{eqnarray} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) &=& (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}_m|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \nonumber \\ && \text{Swapping } \boldsymbol{x} \text{ and } \boldsymbol{\mu} \text{ in preparation for differentiation with respect to } \boldsymbol{\mu}_m\text{,} \nonumber\\ &=& (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}_m|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \end{eqnarray}

Using this, the term after $\pi_m$ in the numerator of Eq.(\ref{dmu}) can be written as follows.

\begin{eqnarray} &&\frac{\partial}{\partial\boldsymbol{\mu}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) && \nonumber\\ &&\hspace{3em}= \frac{\partial}{\partial\boldsymbol{\mu}_m} (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}_m|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \nonumber\\ &&\hspace{3em}= (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}_m|^{-\frac{1}{2}} \frac{\partial}{\partial\boldsymbol{\mu}_m} \exp\left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \nonumber\\ &&\hspace{4em} \text{In general, } \exp'(f(\boldsymbol{x}))=\exp(f(\boldsymbol{x}))f'(\boldsymbol{x}) \text{, so} \nonumber\\ &&\hspace{3em}= (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}_m|^{-\frac{1}{2}} \exp\left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \nonumber\\ &&\hspace{3em} \quad \times \frac{\partial}{\partial\boldsymbol{\mu}_m} \left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \nonumber\\ &&\hspace{3em}= N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \frac{\partial}{\partial\boldsymbol{\mu}_m} \left\{-\frac{1}{2}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \nonumber\\ &&\hspace{3em}= -\frac{1}{2} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \frac{\partial}{\partial\boldsymbol{\mu}_m} \left\{(\boldsymbol{\mu}_m-\boldsymbol{x}_n)^T\boldsymbol{\Sigma}_m^{-1}(\boldsymbol{\mu}_m-\boldsymbol{x}_n)\right\} \\ &&\hspace{4em} \text{In general, } \displaystyle\frac{\partial\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}}{\partial\boldsymbol{x}}=(\boldsymbol{A}+\boldsymbol{A}^T)\boldsymbol{x} \text{, so}^* \nonumber\\ &&\hspace{3em}= -\frac{1}{2} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \left(\boldsymbol{\Sigma}_m^{-1}+\boldsymbol{\Sigma}_m^{-T}\right) (\boldsymbol{\mu}_m-\boldsymbol{x}_n) \\ &&\hspace{4em} \text{Since the covariance matrix } \boldsymbol{\Sigma}_m \text{ is symmetric, } \boldsymbol{\Sigma}_m^{-1} \text{ is also symmetric, so} \nonumber\\ &&\hspace{3em} \left(\boldsymbol{\Sigma}_m^{-1}+\boldsymbol{\Sigma}_m^{-T}\right)=2 \boldsymbol{\Sigma}_m^{-1} \nonumber\\ &&\hspace{3em}= - N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \boldsymbol{\Sigma}_m^{-1} (\boldsymbol{\mu}_m-\boldsymbol{x}_n) \\ &&\hspace{3em}= N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \boldsymbol{\Sigma}_m^{-1} (\boldsymbol{x}_n-\boldsymbol{\mu}_m) \end{eqnarray}

* Proof that $\displaystyle\frac{\partial\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x}}{\partial\boldsymbol{x}}=(\boldsymbol{A}+\boldsymbol{A}^T)\boldsymbol{x}$

Therefore, Eq.(\ref{dmu}) becomes

\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\mu}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& \sum_{n=1}^N \frac{\pi_m \displaystyle\frac{\partial}{\partial\boldsymbol{\mu}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \nonumber\\ &=& \sum_{n=1}^N \frac{\pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \boldsymbol{\Sigma}_m^{-1} (\boldsymbol{x}_n-\boldsymbol{\mu}_m) \end{eqnarray}
\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\mu}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& \sum_{n=1}^N \gamma_{n,m} \boldsymbol{\Sigma}_m^{-1} (\boldsymbol{x}_n-\boldsymbol{\mu}_m) \\ && \text{At a local maximum, the gradient is } \boldsymbol{0}\text{, so} \nonumber\\ &=& \boldsymbol{0} \end{eqnarray}

For this to hold, we need

\begin{eqnarray} \sum_{n=1}^N \gamma_{n,m} \boldsymbol{\Sigma}_m^{-1} \boldsymbol{x}_n &=& \sum_{n=1}^N \gamma_{n,m} \boldsymbol{\Sigma}_m^{-1} \boldsymbol{\mu}_m \end{eqnarray}

Multiplying both sides from the left by $\boldsymbol{\Sigma}_m$,

\begin{eqnarray} \sum_{n=1}^N \gamma_{n,m} \boldsymbol{x}_n &=& \sum_{n=1}^N \gamma_{n,m} \boldsymbol{\mu}_m \end{eqnarray}

Therefore,

\begin{eqnarray} \boldsymbol{\mu}_m &=& \frac{\displaystyle\sum_{n=1}^N \gamma_{n,m} \boldsymbol{x}_n}{\displaystyle\sum_{n=1}^N \gamma_{n,m}} \end{eqnarray}

Deriving $\boldsymbol{\Sigma}_m$ that Maximizes the Log-Likelihood

In this section, we derive the maximum likelihood estimator of the covariance matrix. Since direct differentiation is difficult, we reparameterize the log-likelihood in terms of the precision matrix (the inverse), and combine the derivative of the log-determinant with the derivative of the outer-product quadratic form to obtain the estimator. The result is the responsibility-weighted mean of the outer products of the deviations.

Let the inverse of the covariance matrix $\boldsymbol{\Sigma}_m$ be

\begin{eqnarray} \boldsymbol{\Gamma}_m=\boldsymbol{\Sigma}_m^{-1} \label{G} \end{eqnarray}

Then a solution of

\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\Sigma}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}_m)\ =\ \boldsymbol{0} \end{eqnarray}

can be obtained by solving

\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\Gamma}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Gamma}_m^{-1})\ =\ \boldsymbol{0} \end{eqnarray}
\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\Gamma}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Gamma}_m^{-1}) &=& \frac{\partial}{\partial\boldsymbol{\Gamma}_m} \sum_{n=1}^N \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1}) \\ &=& \sum_{n=1}^N \frac{\partial}{\partial\boldsymbol{\Gamma}_m} \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1}) \\ && \text{In general, } \log'(f(\boldsymbol{x}))=\frac{f'(\boldsymbol{x})}{f(\boldsymbol{x})} \text{, so} \nonumber\\ &=& \sum_{n=1}^N \frac{\displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m} \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1})}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1})} \\ && \text{Since } \frac{\partial}{\partial\boldsymbol{\Gamma}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)=0 \text{ when } k\neq m\text{,} \nonumber\\ &=& \sum_{n=1}^N \frac{\pi_m \displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Gamma}_m^{-1})}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1})} \label{dgamma} \end{eqnarray}

From Eq.(\ref{Normal}), the normal distribution $N(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma})$ is

\begin{eqnarray} N(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma}) &=& (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Sigma}|^{-\frac{1}{2}}\exp\left\{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}-\boldsymbol{\mu})\right\} \nonumber \end{eqnarray}

Therefore, the term after $\pi_m$ in the numerator of Eq.(\ref{dgamma}) can be written as follows.

\begin{eqnarray} && \frac{\partial}{\partial\boldsymbol{\Gamma}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Gamma}_m^{-1}) \nonumber\\ &=& (2\pi)^{-\frac{D}{2}} \frac{\partial}{\partial\boldsymbol{\Gamma}_m}|\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \label{dNdG} \end{eqnarray}

Since in general $\{f(\boldsymbol{x})g(\boldsymbol{x})\}'=f'(\boldsymbol{x})g(\boldsymbol{x})+f(\boldsymbol{x})g'(\boldsymbol{x})$, the expression after $\displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m}$ can be written as

\begin{eqnarray} && \left(\frac{\partial}{\partial\boldsymbol{\Gamma}_m} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}}\right) \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \nonumber\\ && + |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \left[\frac{\partial}{\partial\boldsymbol{\Gamma}_m} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \right] \label{ddGm} \end{eqnarray}

The first part (inside the parentheses) is

\begin{eqnarray} \frac{\partial}{\partial\boldsymbol{\Gamma}_m} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} &=& \frac{1}{2} |\boldsymbol{\Gamma}_m|^{-\frac{1}{2}} \frac{\partial}{\partial\boldsymbol{\Gamma}_m} |\boldsymbol{\Gamma}_m| \\ && \text{In general, } \frac{\partial|\boldsymbol{X}|}{\partial\boldsymbol{X}}=|\boldsymbol{X}|\boldsymbol{X}^{-T} \text{, so}^* \nonumber\\ &=& \frac{1}{2} |\boldsymbol{\Gamma}_m|^{-\frac{1}{2}} |\boldsymbol{\boldsymbol{\Gamma}_m}|\boldsymbol{\boldsymbol{\Gamma}_m}^{-T} \\ &=& \frac{1}{2} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}}\boldsymbol{\boldsymbol{\Gamma}_m}^{-T} \end{eqnarray}

* Proof that $\displaystyle\frac{\partial|\boldsymbol{X}|}{\partial\boldsymbol{X}}=|\boldsymbol{X}|\boldsymbol{X}^{-T}$

The second part (inside the brackets) is

\begin{eqnarray} &&\hspace{-2em}\displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \\ && \hspace{2em}\text{In general, } \exp'(f(\boldsymbol{x}))=\exp(f(\boldsymbol{x}))f'(\boldsymbol{x}) \text{, so} \nonumber\\ && \hspace{2em}=\exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \frac{\partial}{\partial\boldsymbol{\Gamma}_m} \left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \label{dexpdGm} \\ && \hspace{2em}\text{In general, } \displaystyle\frac{\partial\boldsymbol{a}^T \boldsymbol{X} \boldsymbol{a}}{\partial\boldsymbol{X}}=\boldsymbol{a}\boldsymbol{a}^T \text{, so}^* \nonumber\\ && \hspace{2em}=-\frac{1}{2} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \end{eqnarray}

* Proof that $\displaystyle\frac{\partial\boldsymbol{a}^T \boldsymbol{X} \boldsymbol{a}}{\partial\boldsymbol{X}}=\boldsymbol{a}\boldsymbol{a}^T$

Therefore, Eq.(\ref{ddGm}) becomes

Writing $\boldsymbol{E}_m = \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\}$ for brevity,

\begin{eqnarray} && \hspace{-2em}\left(\frac{\partial}{\partial\boldsymbol{\Gamma}_m} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}}\right) \boldsymbol{E}_m + |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \left(\frac{\partial}{\partial\boldsymbol{\Gamma}_m} \boldsymbol{E}_m \right) \\ && \hspace{2em}=\frac{1}{2} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \boldsymbol{\Gamma}_m^{-T} \boldsymbol{E}_m - \frac{1}{2} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \boldsymbol{E}_m (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \\ && \hspace{2em}\text{The common factor } |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \boldsymbol{E}_m \text{ is a scalar and can be factored out:} \nonumber\\ && \hspace{2em}=\frac{1}{2} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \boldsymbol{E}_m \left\{ \boldsymbol{\Gamma}_m^{-T} - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \end{eqnarray}

Eq.(\ref{dNdG}) is obtained by multiplying Eq.(\ref{ddGm}) by $(2\pi)^{-\frac{D}{2}}$, so

\begin{eqnarray} && \hspace{-2em}\frac{\partial}{\partial\boldsymbol{\Gamma}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Gamma}_m^{-1}) \nonumber\\ && \hspace{2em}=(2\pi)^{-\frac{D}{2}} \frac{\partial}{\partial\boldsymbol{\Gamma}_m}|\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \nonumber\\ && \hspace{2em}\text{Substituting Eq.(\ref{ddGm}),} \nonumber\\ && \hspace{2em}=\frac{1}{2} (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \exp\left\{-\frac{1}{2}(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T\boldsymbol{\Gamma}_m(\boldsymbol{x}_n-\boldsymbol{\mu}_m)\right\} \nonumber\\ && \hspace{2em}\quad \times \left\{ \boldsymbol{\Gamma}_m^{-T} - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \nonumber\\ && \hspace{2em}\text{The } (2\pi)^{-\frac{D}{2}} |\boldsymbol{\Gamma}_m|^{\frac{1}{2}} \exp\{\cdots\} \text{ part above is } N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)\text{, so} \nonumber\\ && \hspace{2em}=\frac{1}{2} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m) \left\{ \boldsymbol{\Gamma}_m^{-T} - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \label{ddGN} \end{eqnarray}

Therefore, Eq.(\ref{dgamma}) becomes

\begin{eqnarray} && \hspace{-2em}\frac{\partial}{\partial\boldsymbol{\Gamma}_m} \log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Gamma}_m^{-1}) \nonumber\\ && \hspace{2em}=\sum_{n=1}^N \frac{\pi_m \displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Gamma}_m^{-1})}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1})} \nonumber\\ && \hspace{2em}\text{Substituting Eq.(\ref{ddGN}),} \nonumber\\ && \hspace{2em}=\frac{1}{2} \sum_{n=1}^N \frac{\pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Gamma}_k^{-1})} \left\{ \boldsymbol{\Gamma}_m^{-T} - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \\ && \hspace{2em}\text{Since } \boldsymbol{\Gamma}_m \text{ is the inverse of } \boldsymbol{\Sigma}_m\text{, } \boldsymbol{\Gamma}_m^{-T}=\boldsymbol{\Sigma}_m^T \nonumber\\ && \hspace{2em}=\frac{1}{2} \sum_{n=1}^N \frac{\pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \left\{ \boldsymbol{\Sigma}_m^T - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \\ && \hspace{2em}\text{Using } \gamma_{n,m} \text{ from Eq.(\ref{gamma}),} \nonumber\\ && \hspace{2em}=\frac{1}{2} \sum_{n=1}^N \gamma_{n,m} \left\{ \boldsymbol{\Sigma}_m^T - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \\ && \hspace{2em}\text{Since } \boldsymbol{\Sigma}_m \text{ is symmetric, } \boldsymbol{\Sigma}_m^T=\boldsymbol{\Sigma}_m \nonumber\\ && \hspace{2em}=\frac{1}{2} \sum_{n=1}^N \gamma_{n,m} \left\{ \boldsymbol{\Sigma}_m - (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \right\} \\ && \hspace{2em}\text{At a local maximum, the gradient is } \boldsymbol{0}\text{, so} \nonumber\\ && \hspace{2em}=\boldsymbol{0} \end{eqnarray}

Setting this to zero gives

\begin{eqnarray} \sum_{n=1}^N \gamma_{n,m} \boldsymbol{\Sigma}_m &=& \sum_{n=1}^N \gamma_{n,m} (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \end{eqnarray}

from which we obtain

\begin{eqnarray} \boldsymbol{\Sigma}_m &=& \frac{1}{\displaystyle\sum_{n=1}^N \gamma_{n,m}} \sum_{n=1}^N \gamma_{n,m} (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T \end{eqnarray}

Deriving $\pi_m$ that Maximizes the Log-Likelihood

In this section, we derive the maximum likelihood estimator of the mixing coefficients. Since the mixing coefficients are subject to the equality constraint that they sum to 1, we formulate the problem as constrained optimization using the method of Lagrange multipliers and eliminate the multiplier to obtain a closed-form solution.

The mixing coefficients $\pi_m$ must satisfy the strong constraint $\displaystyle\sum_{m=1}^K \pi_m=1$. We therefore use the method of Lagrange multipliers to maximize the log-likelihood function augmented by the term $\lambda\left(\displaystyle\sum_{k=1}^K \pi_k -1\right)$.

\begin{eqnarray} && \hspace{-2em}\frac{\partial}{\partial\pi_m} \left\{\log p(\boldsymbol{X}|\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}) + \lambda\left(\sum_{k=1}^K \pi_k -1\right)\right\} \nonumber\\ && \hspace{2em}=\left\{\frac{\partial}{\partial\pi_m} \sum_{n=1}^N \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)\right\}+\lambda \\ && \hspace{2em}=\left\{\sum_{n=1}^N \frac{\partial}{\partial\pi_m} \log \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)\right\}+\lambda \\ && \hspace{2em}\text{In general, } \log'(f(\boldsymbol{x}))=\frac{f'(\boldsymbol{x})}{f(\boldsymbol{x})} \text{, so} \nonumber\\ && \hspace{2em}=\sum_{n=1}^N \frac{\displaystyle\frac{\partial}{\partial\pi_m} \sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} +\lambda \\ && \hspace{2em}\text{Since } \frac{\partial}{\partial\pi_m} \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)=0 \text{ when } k\neq m\text{,} \nonumber\\ && \hspace{2em}=\sum_{n=1}^N \frac{N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)}+\lambda \\ && \hspace{2em}\text{Using } \gamma_{n,m} \text{ from Eq.(\ref{gamma}),} \nonumber\\ && \hspace{2em}=\frac{1}{\pi_m} \sum_{n=1}^N \gamma_{n,m}+\lambda \\ && \hspace{2em}=0 \end{eqnarray}

Setting this to zero and multiplying both sides by $\pi_m$,

\begin{eqnarray} \sum_{n=1}^N \gamma_{n,m} + \pi_m \lambda &=& 0 \end{eqnarray}

gives

\begin{eqnarray} \pi_m &=& -\frac{1}{\lambda} \sum_{n=1}^N \gamma_{n,m} \label{pim} \end{eqnarray}

From Eq.(\ref{sumpi1}),

\begin{eqnarray} \sum_{m=1}^K \pi_m &=& 1 \end{eqnarray}

Considering this constraint,

\begin{eqnarray} \sum_{m=1}^K \pi_m &=& - \sum_{m=1}^K \frac{1}{\lambda} \sum_{n=1}^N \gamma_{n,m} &=& 1 \end{eqnarray}

Rearranging,

\begin{eqnarray} - \sum_{m=1}^K \frac{1}{\lambda} \sum_{n=1}^N \gamma_{n,m} &=& - \frac{1}{\lambda} \sum_{m=1}^K \sum_{n=1}^N \gamma_{n,m} \\ &=& - \frac{1}{\lambda} \sum_{n=1}^N \sum_{m=1}^K \gamma_{n,m} \\ && \text{From Eq.(\ref{sumgamma}), } \sum_{m=1}^K \gamma_{n,m}=1 \text{, so} \nonumber\\ &=& - \frac{1}{\lambda} \sum_{n=1}^N 1 \\ &=& - \frac{N}{\lambda} \\ &=& 1 \end{eqnarray}

which gives

\begin{eqnarray} \lambda &=& -N \end{eqnarray}

Substituting into Eq.(\ref{pim}),

\begin{eqnarray} \pi_m &=& -\frac{1}{\lambda} \sum_{n=1}^N \gamma_{n,m} \nonumber\\ &=& \frac{1}{N} \sum_{n=1}^N \gamma_{n,m} \end{eqnarray}

EM Algorithm

Summarizing the results obtained so far,

\begin{eqnarray} \gamma_{n,m} &=& \frac{\pi_m N(\boldsymbol{x}_n|\boldsymbol{\mu}_m,\boldsymbol{\Sigma}_m)}{\displaystyle\sum_{k=1}^K \pi_k N(\boldsymbol{x}_n|\boldsymbol{\mu}_k,\boldsymbol{\Sigma}_k)} \nonumber \end{eqnarray}

the parameters that maximize the log-likelihood function subject to the above are

\begin{eqnarray} \left\{ \begin{array}{rcl} \boldsymbol{\mu}_m &=& \frac{\displaystyle\sum_{n=1}^N \gamma_{n,m} \boldsymbol{x}_n}{\displaystyle\sum_{n=1}^N \gamma_{n,m}}, \\ \boldsymbol{\Sigma}_m &=& \displaystyle\frac{1}{\displaystyle\sum_{n=1}^N \gamma_{n,m}} \displaystyle\sum_{n=1}^N \gamma_{n,m} (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T, \\ \pi_m &=& \displaystyle\frac{1}{N} \sum_{n=1}^N \gamma_{n,m} \end{array} \right.\nonumber \end{eqnarray}

Since $\displaystyle\sum_{n=1}^N \gamma_{n,m}$ can be interpreted as the estimated number of samples generated by the $m$-th normal distribution, we write it as $N_m$ for a more compact notation.

\begin{eqnarray} \left\{ \begin{array}{rcl} \boldsymbol{\mu}_m &=& \displaystyle\frac{1}{N_m} \sum_{n=1}^N \gamma_{n,m} \boldsymbol{x}_n,\\ \boldsymbol{\Sigma}_m &=& \displaystyle\frac{1}{N_m} \sum_{n=1}^N \gamma_{n,m} (\boldsymbol{x}_n-\boldsymbol{\mu}_m)(\boldsymbol{x}_n-\boldsymbol{\mu}_m)^T,\\ \pi_m &=& \displaystyle\frac{N_m}{N} \end{array} \right. \end{eqnarray}

However, since the formula for $\gamma_{n,m}$ contains $\boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m, \pi_m$, and the formulas for $\boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m, \pi_m$ contain $\gamma_{n,m}$, this system of equations is nonlinear and cannot be solved in closed form. We therefore adopt the EM algorithm, which starts from initial values and alternates between the E-step and M-step, iteratively converging toward the optimal solution.

Procedure of the EM Algorithm
  1. Set initial values $\boldsymbol{\mu}_m^{(0)}, \boldsymbol{\Sigma}_m^{(0)}, \pi_m^{(0)}$ and let $i=0$.
  2. Using $\boldsymbol{\mu}_m^{(i)}, \boldsymbol{\Sigma}_m^{(i)}, \pi_m^{(i)}$, compute $\gamma_{n,m}^{(i)}$ and $N_m^{(i)}$ (E-step):
    \begin{eqnarray} \left\{ \begin{array}{rcl} \gamma_{n,m}^{(i)} &=& \frac{\pi_m^{(i)} N(\boldsymbol{x}_n|\boldsymbol{\mu}_m^{(i)},\boldsymbol{\Sigma}_m^{(i)})}{\displaystyle\sum_{k=1}^K \pi_k^{(i)} N(\boldsymbol{x}_n|\boldsymbol{\mu}_k^{(i)},\boldsymbol{\Sigma}_k^{(i)})} \\ N_m^{(i)} &=& \displaystyle\sum_{n=1}^N \gamma_{n,m}^{(i)} \end{array} \right. \end{eqnarray}
  3. Using $\gamma_{n,m}^{(i)}$ and $N_m^{(i)}$, update the parameters $\boldsymbol{\mu}_m^{(i+1)}, \boldsymbol{\Sigma}_m^{(i+1)}, \pi_m^{(i+1)}$ (M-step):
    \begin{eqnarray} \left\{ \begin{array}{rcl} \boldsymbol{\mu}_m^{(i+1)} &=& \displaystyle\frac{1}{N_m^{(i)}} \sum_{n=1}^N \gamma_{n,m}^{(i)} \boldsymbol{x}_n, \\ \boldsymbol{\Sigma}_m^{(i+1)} &=& \displaystyle\frac{1}{N_m^{(i)}} \sum_{n=1}^N \gamma_{n,m}^{(i)} (\boldsymbol{x}_n-\boldsymbol{\mu}_m^{(i)})(\boldsymbol{x}_n-\boldsymbol{\mu}_m^{(i)})^T, \\ \pi_m^{(i+1)} &=& \displaystyle\frac{N_m^{(i)}}{N} \end{array} \right. \label{update} \end{eqnarray}
  4. Evaluate the log-likelihood with the updated parameters $\boldsymbol{\mu}_m^{(i+1)}, \boldsymbol{\Sigma}_m^{(i+1)}, \pi_m^{(i+1)}$:
    \begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\pi}^{(i+1)},\boldsymbol{\mu}^{(i+1)},\boldsymbol{\Sigma}^{(i+1)}) &=& \sum_{n=1}^N \log \sum_{k=1}^K \pi_k^{(i+1)} N(\boldsymbol{x}_n|\boldsymbol{\mu}_k^{(i+1)},\boldsymbol{\Sigma}_k^{(i+1)}) \end{eqnarray}
    If sufficiently converged, terminate. Otherwise, set $i\leftarrow i+1$ and return to Step 2.
Caution

The EM algorithm is not guaranteed to converge to the global optimum from an arbitrary initial point.

Variant Using the Updated $\boldsymbol{\mu}_m^{(i+1)}$ (ECM)

In the computation of $\boldsymbol{\Sigma}_m^{(i+1)}$ in Eq.(\ref{update}), one may use $\boldsymbol{\mu}_m^{(i+1)}$ (the updated value computed earlier in the same M-step) instead of $\boldsymbol{\mu}_m^{(i)}$ (the old value).

\begin{eqnarray} \boldsymbol{\Sigma}_m^{(i+1)} &=& \frac{1}{N_m^{(i)}} \sum_{n=1}^N \gamma_{n,m}^{(i)} (\boldsymbol{x}_n-\boldsymbol{\mu}_m^{\color{red}(i+1)})(\boldsymbol{x}_n-\boldsymbol{\mu}_m^{\color{red}(i+1)})^T \end{eqnarray}

This approach can be viewed as an instance of the ECM (Expectation Conditional Maximization) algorithm [Meng & Rubin, 1993], where the M-step sequentially performs conditional maximization over $\boldsymbol{\mu}_m \to \boldsymbol{\Sigma}_m \to \pi_m$.

The standard convergence proof of EM relies on the M-step maximizing all parameters simultaneously, so it does not directly apply to sequential updates. However, Meng & Rubin (1993) proved that if each conditional maximization step increases $Q$, then the log-likelihood is non-decreasing overall. Since GMM satisfies this condition, convergence is preserved.

In practice, using the updated $\boldsymbol{\mu}_m^{(i+1)}$ can sometimes lead to faster convergence. Major libraries such as scikit-learn adopt this approach.

Convergence of the EM Algorithm

In this section, we prove that the log-likelihood monotonically increases (more precisely, is non-decreasing) at each iteration of the EM algorithm. We decompose the log-likelihood into the ELBO (Evidence Lower Bound) and the KL divergence, and show that the E-step sets the KL divergence to zero while the M-step maximizes the ELBO, thereby ensuring the log-likelihood improves at each iteration.

We show that the log-likelihood monotonically increases (is non-decreasing) at each iteration of the EM algorithm.

Lower Bound of the Log-Likelihood

Let $\boldsymbol{\theta}=\{\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}\}$ denote the set of parameters and $\boldsymbol{Z}=\{z_1, z_2, \cdots, z_N\}$ the set of latent variables. For any probability distribution $q(\boldsymbol{Z})$ over the latent variables $\boldsymbol{Z}$, the log-likelihood can be decomposed as follows.

\begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\theta}) &=& \log p(\boldsymbol{X}|\boldsymbol{\theta}) \underbrace{\sum_{\boldsymbol{Z}} q(\boldsymbol{Z})}_{=1} \\ &=& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log p(\boldsymbol{X}|\boldsymbol{\theta}) \\ && \text{By the definition of conditional probability, } p(\boldsymbol{X},\boldsymbol{Z}|\boldsymbol{\theta}) = p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta}) p(\boldsymbol{X}|\boldsymbol{\theta}) \text{, so} \nonumber \\ &=& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{p(\boldsymbol{X},\boldsymbol{Z}|\boldsymbol{\theta})}{p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})} \\ &=& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{p(\boldsymbol{X},\boldsymbol{Z}|\boldsymbol{\theta})}{q(\boldsymbol{Z})} \cdot \frac{q(\boldsymbol{Z})}{p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})} \\ &=& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{p(\boldsymbol{X},\boldsymbol{Z}|\boldsymbol{\theta})}{q(\boldsymbol{Z})} + \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{q(\boldsymbol{Z})}{p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})} \\ &=& \mathcal{L}(q,\boldsymbol{\theta}) + \mathrm{KL}(q \| p) \end{eqnarray}

Here, $\mathcal{L}(q,\boldsymbol{\theta})$ and $\mathrm{KL}(q \| p)$ are defined as follows.

\begin{eqnarray} \mathcal{L}(q,\boldsymbol{\theta}) &\triangleq& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{p(\boldsymbol{X},\boldsymbol{Z}|\boldsymbol{\theta})}{q(\boldsymbol{Z})} \\ \mathrm{KL}(q \| p) &\triangleq& \sum_{\boldsymbol{Z}} q(\boldsymbol{Z}) \log \frac{q(\boldsymbol{Z})}{p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})} \end{eqnarray}

$\mathcal{L}(q,\boldsymbol{\theta})$ is called the Evidence Lower Bound (ELBO), also known as the negative variational free energy. The fundamental idea of the EM algorithm is to maximize this lower bound instead of the log-likelihood $\log p(\boldsymbol{X}|\boldsymbol{\theta})$, which is difficult to maximize directly.

$\mathrm{KL}(q \| p)$ is the Kullback-Leibler divergence (KL divergence) between $q(\boldsymbol{Z})$ and the posterior distribution $p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})$, which always satisfies $\mathrm{KL}(q \| p) \geq 0$ (with equality when $q(\boldsymbol{Z}) = p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})$). Therefore, $\mathcal{L}(q,\boldsymbol{\theta})$ is a lower bound on the log-likelihood.

\begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\theta}) &\geq& \mathcal{L}(q,\boldsymbol{\theta}) \end{eqnarray}

This inequality means that $\mathcal{L}(q,\boldsymbol{\theta})$ is always less than or equal to the log-likelihood. Direct maximization of $\log p(\boldsymbol{X}|\boldsymbol{\theta})$ is difficult (because the $\sum$ is inside the $\log$), but by maximizing the lower bound $\mathcal{L}(q,\boldsymbol{\theta})$, we can indirectly increase the log-likelihood. This is the fundamental idea behind the EM algorithm.

Interpretation of the E-step

In the E-step, we fix the current parameters $\boldsymbol{\theta}^{(i)}$ and find the $q(\boldsymbol{Z})$ that maximizes the lower bound $\mathcal{L}(q,\boldsymbol{\theta}^{(i)})$.

In $\log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i)}) = \mathcal{L}(q,\boldsymbol{\theta}^{(i)}) + \mathrm{KL}(q \| p)$, the left-hand side does not depend on $q$, so maximizing $\mathcal{L}$ is equivalent to minimizing $\mathrm{KL}(q \| p)$. Since $\mathrm{KL}(q \| p) \geq 0$ with equality when $q(\boldsymbol{Z}) = p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta}^{(i)})$, the optimal solution of the E-step is

\begin{eqnarray} q^{(i)}(\boldsymbol{Z}) &=& p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta}^{(i)}) \end{eqnarray}

At this point, $\mathrm{KL}(q^{(i)} \| p) = 0$, and the lower bound equals the log-likelihood.

\begin{eqnarray} \mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i)}) &=& \log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i)}) \end{eqnarray}

Interpretation of the M-step

In the M-step, we fix $q^{(i)}(\boldsymbol{Z})$ obtained in the E-step and find the parameters $\boldsymbol{\theta}^{(i+1)}$ that maximize the lower bound $\mathcal{L}(q^{(i)},\boldsymbol{\theta})$.

\begin{eqnarray} \boldsymbol{\theta}^{(i+1)} &=& \arg\max_{\boldsymbol{\theta}} \mathcal{L}(q^{(i)},\boldsymbol{\theta}) \end{eqnarray}

By maximization, we have $\mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i+1)}) \geq \mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i)})$.

Monotonic Increase of the Log-Likelihood

From the above, tracking the change in log-likelihood at each iteration gives

\begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i+1)}) &=& \mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i+1)}) + \mathrm{KL}(q^{(i)} \| p(\cdot|\boldsymbol{X},\boldsymbol{\theta}^{(i+1)})) \\ &\geq& \mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i+1)}) \\ && \text{Since the lower bound was maximized in the M-step,} \nonumber \\ &\geq& \mathcal{L}(q^{(i)},\boldsymbol{\theta}^{(i)}) \\ && \text{Since } \mathrm{KL}=0 \text{ was set in the E-step,} \nonumber \\ &=& \log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i)}) \end{eqnarray}

Therefore, the log-likelihood monotonically increases (is non-decreasing) at each iteration of the EM algorithm.

\begin{eqnarray} \log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i+1)}) &\geq& \log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i)}) \end{eqnarray}

If singular solutions are excluded by regularizing the covariance matrix (e.g., adding $\varepsilon\boldsymbol{I}$ to $\boldsymbol{\Sigma}_m$), the log-likelihood is bounded above, and hence the monotonically increasing sequence converges. However, there is no guarantee that the convergence point is the global optimum; convergence to a local optimum or a saddle point is possible.

References

  • C.M. Bishop, "Pattern Recognition and Machine Learning", Springer, 2006
  • X.L. Meng and D.B. Rubin, "Maximum Likelihood Estimation via the ECM Algorithm: A General Framework", Biometrika, 80(2), pp.267-278, 1993