What is GMM (Gaussian Mixture Model)? Complete Derivation of the EM Algorithm
Gaussian Mixture Model & EM Algorithm
What is GMM?
Definition: GMM (Gaussian Mixture Model)
A Gaussian Mixture Model (GMM) is a probabilistic model that represents the probability distribution of data as a weighted sum of multiple normal (Gaussian) distributions.
Here, $M$ is the number of component distributions, $\pi_m$ is the mixing coefficient ($\sum_m \pi_m = 1$), $\boldsymbol{\mu}_m$ is the mean vector, and $\boldsymbol{\Sigma}_m$ is the covariance matrix.
Intuitive Understanding
Consider a data distribution with multiple peaks that cannot be adequately represented by a single normal distribution. A GMM represents each peak as an independent normal distribution and combines them through mixing coefficients $\pi_m$, thereby flexibly approximating complex distributions. The parameters are estimated using the EM algorithm (Expectation-Maximization).
Applications
Clustering: Unlike K-means, each data point can belong to multiple clusters with different probabilities (soft clustering)
Anomaly detection: Learn the density of normal data with a GMM and flag points in low-probability regions as anomalies
Speech recognition: Modeling the acoustic features of speakers
Density estimation: As a parametric alternative to kernel density estimation
GMM vs K-means vs Kernel Density Estimation
Feature
GMM
K-means
Kernel Density Estimation
Cluster assignment
Soft (probabilistic)
Hard (single cluster)
—
Cluster shape
Elliptical (arbitrary covariance)
Spherical only
—
Probabilistic model
Yes (generative model)
No
Yes (nonparametric)
Number of parameters
$O(MD^2)$
$O(MD)$
$O(ND)$ (depends on data size)
Estimation method
EM algorithm
Lloyd's algorithm
Bandwidth selection
Introduction (Scope of This Article)
Building on the overview of GMM presented above, this article provides a complete derivation without skipping any intermediate steps of the parameter estimation via the EM algorithm.
Specifically, for each of the mean vector, the covariance matrix, and the mixing coefficients, we rigorously derive the maximum likelihood estimator from the partial derivatives of the log-likelihood function using matrix calculus.
For the covariance matrix, we adopt the approach of working through the precision matrix (the inverse), combining the derivative of the determinant with the derivative of the quadratic form.
The mixing coefficients are estimated via constrained optimization using the method of Lagrange multipliers, and finally, we present a convergence proof of the EM algorithm based on the ELBO (Evidence Lower Bound) and the KL divergence.
Throughout this article, boldface denotes vectors or matrices, and for a matrix $\boldsymbol{A}$, we write $\boldsymbol{A}^{-T}=\left(\boldsymbol{A}^{-1}\right)^T$ for the transpose of its inverse.
Normal Distribution
The $D$-dimensional normal distribution, parameterized by a mean vector $\boldsymbol{\mu}\in\mathbb{R}^D$ and a nonsingular covariance matrix $\boldsymbol{\Sigma}\in\mathbb{R}^{D\times D}$, is given by
We define the Gaussian mixture distribution as a linear combination of normal distributions $N(\boldsymbol{x}|\boldsymbol{\mu},\boldsymbol{\Sigma})$ as follows.
Assuming the observed samples $\boldsymbol{x}_n\in\mathbb{R}^D,\ n=1,2,3,\cdots N$ are generated from the Gaussian mixture model in Eq.(\ref{GMM}), we seek the parameters $\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}$ that maximize (more precisely, find a local maximum of) the log-likelihood function
Let $z_n \in \{1, 2, \cdots, K\}$ be a latent variable indicating which normal distribution generated the sample $\boldsymbol{x}_n$, and let $P(z_n = m) = \pi_m$.
The posterior probability that the sample $\boldsymbol{x}_n$ was generated by the $m$-th normal distribution can be computed using Bayes' theorem as
Since the sample $\boldsymbol{x}_n$ was generated by one of the $K$ normal distributions, summing $\gamma_{n,m}$ over $m=1,2,3,\cdots K$ naturally yields a total probability of 1.
Deriving $\boldsymbol{\mu}_m$ that Maximizes the Log-Likelihood
In this section, we differentiate the log-likelihood function with respect to the mean vector and apply the vector derivative of a quadratic form to derive the maximum likelihood estimator.
The result shows that the mean vector of each cluster is the responsibility-weighted average of the samples.
The following derivation adopts the denominator layout convention: the derivative of a scalar function with respect to a vector is a column vector.
Differentiating the log-likelihood function with respect to $\boldsymbol{\mu}_m$, we obtain
Deriving $\boldsymbol{\Sigma}_m$ that Maximizes the Log-Likelihood
In this section, we derive the maximum likelihood estimator of the covariance matrix.
Since direct differentiation is difficult, we reparameterize the log-likelihood in terms of the precision matrix (the inverse), and combine the derivative of the log-determinant with the derivative of the outer-product quadratic form to obtain the estimator.
The result is the responsibility-weighted mean of the outer products of the deviations.
Let the inverse of the covariance matrix $\boldsymbol{\Sigma}_m$ be
Since in general $\{f(\boldsymbol{x})g(\boldsymbol{x})\}'=f'(\boldsymbol{x})g(\boldsymbol{x})+f(\boldsymbol{x})g'(\boldsymbol{x})$, the expression after $\displaystyle\frac{\partial}{\partial\boldsymbol{\Gamma}_m}$ can be written as
Deriving $\pi_m$ that Maximizes the Log-Likelihood
In this section, we derive the maximum likelihood estimator of the mixing coefficients.
Since the mixing coefficients are subject to the equality constraint that they sum to 1, we formulate the problem as constrained optimization using the method of Lagrange multipliers and eliminate the multiplier to obtain a closed-form solution.
The mixing coefficients $\pi_m$ must satisfy the strong constraint $\displaystyle\sum_{m=1}^K \pi_m=1$. We therefore use the method of Lagrange multipliers to maximize the log-likelihood function augmented by the term $\lambda\left(\displaystyle\sum_{k=1}^K \pi_k -1\right)$.
Since $\displaystyle\sum_{n=1}^N \gamma_{n,m}$ can be interpreted as the estimated number of samples generated by the $m$-th normal distribution, we write it as $N_m$ for a more compact notation.
However, since the formula for $\gamma_{n,m}$ contains $\boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m, \pi_m$, and the formulas for $\boldsymbol{\mu}_m, \boldsymbol{\Sigma}_m, \pi_m$ contain $\gamma_{n,m}$, this system of equations is nonlinear and cannot be solved in closed form. We therefore adopt the EM algorithm, which starts from initial values and alternates between the E-step and M-step, iteratively converging toward the optimal solution.
Procedure of the EM Algorithm
Set initial values $\boldsymbol{\mu}_m^{(0)}, \boldsymbol{\Sigma}_m^{(0)}, \pi_m^{(0)}$ and let $i=0$.
Using $\boldsymbol{\mu}_m^{(i)}, \boldsymbol{\Sigma}_m^{(i)}, \pi_m^{(i)}$, compute $\gamma_{n,m}^{(i)}$ and $N_m^{(i)}$ (E-step):
If sufficiently converged, terminate. Otherwise, set $i\leftarrow i+1$ and return to Step 2.
Caution
The EM algorithm is not guaranteed to converge to the global optimum from an arbitrary initial point.
Variant Using the Updated $\boldsymbol{\mu}_m^{(i+1)}$ (ECM)
In the computation of $\boldsymbol{\Sigma}_m^{(i+1)}$ in Eq.(\ref{update}), one may use $\boldsymbol{\mu}_m^{(i+1)}$ (the updated value computed earlier in the same M-step) instead of $\boldsymbol{\mu}_m^{(i)}$ (the old value).
This approach can be viewed as an instance of the ECM (Expectation Conditional Maximization) algorithm [Meng & Rubin, 1993], where the M-step sequentially performs conditional maximization over $\boldsymbol{\mu}_m \to \boldsymbol{\Sigma}_m \to \pi_m$.
The standard convergence proof of EM relies on the M-step maximizing all parameters simultaneously, so it does not directly apply to sequential updates.
However, Meng & Rubin (1993) proved that if each conditional maximization step increases $Q$, then the log-likelihood is non-decreasing overall. Since GMM satisfies this condition, convergence is preserved.
In practice, using the updated $\boldsymbol{\mu}_m^{(i+1)}$ can sometimes lead to faster convergence.
Major libraries such as scikit-learn adopt this approach.
Convergence of the EM Algorithm
In this section, we prove that the log-likelihood monotonically increases (more precisely, is non-decreasing) at each iteration of the EM algorithm.
We decompose the log-likelihood into the ELBO (Evidence Lower Bound) and the KL divergence, and show that the E-step sets the KL divergence to zero while the M-step maximizes the ELBO, thereby ensuring the log-likelihood improves at each iteration.
We show that the log-likelihood monotonically increases (is non-decreasing) at each iteration of the EM algorithm.
Lower Bound of the Log-Likelihood
Let $\boldsymbol{\theta}=\{\boldsymbol{\pi},\boldsymbol{\mu},\boldsymbol{\Sigma}\}$ denote the set of parameters and $\boldsymbol{Z}=\{z_1, z_2, \cdots, z_N\}$ the set of latent variables.
For any probability distribution $q(\boldsymbol{Z})$ over the latent variables $\boldsymbol{Z}$, the log-likelihood can be decomposed as follows.
$\mathcal{L}(q,\boldsymbol{\theta})$ is called the Evidence Lower Bound (ELBO), also known as the negative variational free energy.
The fundamental idea of the EM algorithm is to maximize this lower bound instead of the log-likelihood $\log p(\boldsymbol{X}|\boldsymbol{\theta})$, which is difficult to maximize directly.
$\mathrm{KL}(q \| p)$ is the Kullback-Leibler divergence (KL divergence) between $q(\boldsymbol{Z})$ and the posterior distribution $p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})$, which always satisfies $\mathrm{KL}(q \| p) \geq 0$ (with equality when $q(\boldsymbol{Z}) = p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta})$).
Therefore, $\mathcal{L}(q,\boldsymbol{\theta})$ is a lower bound on the log-likelihood.
This inequality means that $\mathcal{L}(q,\boldsymbol{\theta})$ is always less than or equal to the log-likelihood.
Direct maximization of $\log p(\boldsymbol{X}|\boldsymbol{\theta})$ is difficult (because the $\sum$ is inside the $\log$), but by maximizing the lower bound $\mathcal{L}(q,\boldsymbol{\theta})$, we can indirectly increase the log-likelihood.
This is the fundamental idea behind the EM algorithm.
Interpretation of the E-step
In the E-step, we fix the current parameters $\boldsymbol{\theta}^{(i)}$ and find the $q(\boldsymbol{Z})$ that maximizes the lower bound $\mathcal{L}(q,\boldsymbol{\theta}^{(i)})$.
In $\log p(\boldsymbol{X}|\boldsymbol{\theta}^{(i)}) = \mathcal{L}(q,\boldsymbol{\theta}^{(i)}) + \mathrm{KL}(q \| p)$, the left-hand side does not depend on $q$, so maximizing $\mathcal{L}$ is equivalent to minimizing $\mathrm{KL}(q \| p)$.
Since $\mathrm{KL}(q \| p) \geq 0$ with equality when $q(\boldsymbol{Z}) = p(\boldsymbol{Z}|\boldsymbol{X},\boldsymbol{\theta}^{(i)})$, the optimal solution of the E-step is
In the M-step, we fix $q^{(i)}(\boldsymbol{Z})$ obtained in the E-step and find the parameters $\boldsymbol{\theta}^{(i+1)}$ that maximize the lower bound $\mathcal{L}(q^{(i)},\boldsymbol{\theta})$.
If singular solutions are excluded by regularizing the covariance matrix (e.g., adding $\varepsilon\boldsymbol{I}$ to $\boldsymbol{\Sigma}_m$), the log-likelihood is bounded above, and hence the monotonically increasing sequence converges.
However, there is no guarantee that the convergence point is the global optimum; convergence to a local optimum or a saddle point is possible.