How are statistics and matrix calculus related?

The core of statistics is estimating parameters from data. Maximum likelihood estimation, MAP, and Bayesian inference all start by differentiating the log-likelihood or posterior with respect to parameters. When parameters are vectors, matrices, or covariance matrices, matrix calculus becomes essential, and the Fisher information matrix, Cramer-Rao bound, and score function are all formulated using matrix calculus.

What is the Fisher information matrix used for?

The Fisher information matrix gives the theoretical limit of estimator precision (the Cramer-Rao lower bound) and is used in Fisher scoring, an alternative to Newton-Raphson. In neural statistics it quantifies the precision of population coding, and in IRT it serves as the item information function. See the neural statistics and IRT pages for details.

When should REML be used?

Use REML in linear mixed models containing both fixed effects and variance components when you want to estimate the variance components. ML is biased downward because it ignores the degrees of freedom consumed by fixed effects, while REML maximizes the likelihood of residuals after removing fixed effects, yielding nearly unbiased estimates. See the variance components page for details.

Where does matrix calculus appear in latent variable models?

In factor analysis and SEM, the observed covariance structure Sigma(theta) is a complex function of parameters Lambda, Phi, Psi, and B; differentiating the ML objective F_ML = log|Sigma| + tr(S Sigma^-1) requires the trace and log-det formulas. The general SEM gradient formula dF/dtheta = tr[(Sigma^-1 - Sigma^-1 S Sigma^-1) dSigma/dtheta] is a unified form applicable to any parameter.

Applications to Statistics

Notation convention
Formulas on this page set use the denominator layout. See Layout Conventions for details.

The role of matrix calculus in statistics

The core of statistics is "estimating parameters from data." Maximum likelihood estimation (MLE), maximum a posteriori (MAP) estimation, and Bayesian inference all begin by differentiating the log-likelihood $\ell(\boldsymbol{\theta})$ or posterior $p(\boldsymbol{\theta}|D)$ with respect to the parameters $\boldsymbol{\theta}$.

Ordinary differentiation suffices when parameters are scalars, but matrix calculus is essential once parameters become vectors, matrices, or covariance matrices. The Fisher information matrix, Cramér-Rao bound, score function, and REML gradient are all formulated through matrix calculus.

This hub organizes the principal applications of matrix calculus encountered in modern statistics into 6 themes. The total volume exceeds 60 formulas, of which 18 belong to latent variable models (factor analysis, SEM, IRT).

Table of contents (6 themes)

1. Foundational distributions and linear models (MVN, Wishart, multivariate regression, vec operator)
2. Latent variable models (factor analysis, SEM, IRT)
3. Variance components models (linear mixed models, REML, genomic selection)
4. Spatial correlation models (variogram, kriging)
5. Neural statistics (Fisher information, Cramér-Rao)
6. Probabilistic models and information geometry (KL, optimal transport, GP, BP)

1. Foundational distributions and linear models

The multivariate normal, Wishart, and multivariate regression are the most basic objects of statistical inference. Likelihood differentiation, closed-form MLE solutions, and expected-variance computations all wrap up here. The vec operator and related matrices (commutation, duplication, elimination), which serve as linearization tools for matrix calculus, are also collected in this theme.

Key formulas: gradients of the MVN log-likelihood with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$ → sample mean and sample covariance as MLE, gradients of the Wishart and inverse-Wishart log-densities, the multivariate regression OLS solution $\hat{\boldsymbol{B}} = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{Y}$, and the vec-Kronecker identity $\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$.

Formula numbers: 19.19-29, 13.7/13.8/13.10/13.11 (15 formulas total)

→ Foundational distributions and linear models page

2. Latent variable models

A family of methods that estimate latent structures (factors, structural equations, ability) which cannot be observed directly. They share the structure "observed distribution = marginalization of the product of latent and observed distributions"; the resulting likelihoods are complex, making matrix calculus an essential tool.

Topics: factor analysis (the ML objective and gradients with respect to $\boldsymbol{\Lambda}, \boldsymbol{\Phi}, \boldsymbol{\Psi}$), structural equation modeling (SEM) (LISREL/RAM covariance structures and the general gradient formula $\partial F/\partial \theta = \text{tr}[(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\partial \boldsymbol{\Sigma}/\partial \theta]$), and item response theory (IRT) (gradients of 2PL/3PL with respect to discrimination, difficulty, ability, and guessing, plus the Fisher information function).

Formula numbers: 19.1-18 (18 formulas total — the largest page in this hub)

→ Latent variable models page

3. Variance components models

Models that capture hierarchical structure in data through a mix of fixed and random effects. Estimating the variance components (REML) involves differentiating the $\log\det$ and $\text{tr}$ of the covariance matrix $\boldsymbol{V}$. One of its applications, genomic selection (GBLUP / RR-BLUP), replaces the pedigree relationship matrix with a genomic relationship matrix $\boldsymbol{G}$ built from DNA markers.

Key formulas: Henderson's mixed-model equations, the REML gradient, the AI (average information) matrix and AI-REML update, the GBLUP breeding value $\hat{\boldsymbol{u}} = \boldsymbol{G}(\boldsymbol{G} + \lambda\boldsymbol{I})^{-1}\boldsymbol{y}$, and the Sherman-Morrison representation of leave-one-out cross-validation.

Formula numbers: 10.28-34 + 10.41-46 (13 formulas total)

→ Variance components models page

4. Spatial correlation models

For spatial data where observation independence cannot be assumed, the correlation structure is modeled by a covariance function (variogram). Maximum-likelihood estimation differentiates with respect to the variogram parameters (nugget, sill, range). Kriging shares the same BLUP structure as the GBLUP from variance-components models.

Key formulas: the semivariogram $\gamma(h) = \frac{1}{2}\text{Var}[Z(\boldsymbol{s}) - Z(\boldsymbol{s}+\boldsymbol{h})]$, the spherical variogram model, the ordinary kriging equations (with unbiasedness constraint), the kriging variance $\sigma_K^2 = C(0) - \boldsymbol{\lambda}^\top \boldsymbol{c}_0 - \mu$, and the gradient of the geostatistical likelihood.

Formula numbers: 10.35-40 (6 formulas total)

→ Spatial correlation models page

5. Neural statistics

The problem of estimating stimulus parameters from neuronal firing-rate data. The central theme is the Fisher information matrix under a Poisson firing model, which gives the theoretical precision limit (Cramér-Rao lower bound) of population coding.

Key formulas: single-neuron Fisher information $I(\theta) = [f'(\theta)]^2/f(\theta)$, the Fisher information for a correlated population $I(\theta) = \boldsymbol{f}'(\theta)^\top \boldsymbol{Q}^{-1} \boldsymbol{f}'(\theta)$, and the Cramér-Rao lower bound $\text{Var}(\hat{\theta}) \geq 1/I(\theta)$.

Formula numbers: 10.24-27 (4 formulas total)

→ Neural statistics page

6. Probabilistic models and information geometry

Distances between probability distributions (KL divergence, optimal transport) and likelihood derivatives of Gaussian processes are advanced applications of matrix calculus. They also form the foundations of information geometry and connect directly to VAEs, Wasserstein GANs, Gaussian-process regression, and probabilistic graphical models.

Key formulas: the KL divergence between Gaussians (the KL term in VAEs), Sinkhorn distance (entropy-regularized Wasserstein), the gradient of the GP log-marginal likelihood $\frac{1}{2}\text{tr}[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \boldsymbol{K}^{-1})\partial \boldsymbol{K}/\partial \theta_i]$, and the message updates of belief propagation (BP).

Formula numbers: 19.30-33 (4 formulas total)

→ Probabilistic models and information geometry page

Key formula highlights

Representative formulas spanning the 6 pages. See each page for full proofs.

MVN log-likelihood: $\ell = -\dfrac{n}{2}\log|\boldsymbol{\Sigma}| - \dfrac{1}{2}\sum_i (\boldsymbol{x}_i - \boldsymbol{\mu})^\top\boldsymbol{\Sigma}^{-1}(\boldsymbol{x}_i - \boldsymbol{\mu})$
(Foundational 19.19)
Fisher information (Poisson firing): $I(\theta) = [f'(\theta)]^2/f(\theta)$
(Neural 10.24)
REML gradient: $\dfrac{\partial \log L_R}{\partial \theta} = -\dfrac{1}{2}\left[\text{tr}(\boldsymbol{P}\partial\boldsymbol{V}/\partial\theta) - \boldsymbol{y}^\top \boldsymbol{P}\partial\boldsymbol{V}/\partial\theta\,\boldsymbol{P}\boldsymbol{y}\right]$
(Variance components 10.32)
SEM general gradient formula: $\dfrac{\partial F_{\text{ML}}}{\partial \theta_i} = \text{tr}[(\boldsymbol{\Sigma}^{-1} - \boldsymbol{\Sigma}^{-1}\boldsymbol{S}\boldsymbol{\Sigma}^{-1})\partial\boldsymbol{\Sigma}/\partial \theta_i]$
(Latent 19.6)
GBLUP breeding value: $\hat{\boldsymbol{u}} = \boldsymbol{G}(\boldsymbol{G} + \lambda\boldsymbol{I})^{-1}\boldsymbol{y}$
(Variance components 10.43)
GP log-marginal-likelihood gradient: $\dfrac{\partial \log p}{\partial \theta_i} = \dfrac{1}{2}\text{tr}[(\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \boldsymbol{K}^{-1})\partial\boldsymbol{K}/\partial \theta_i]$
(Probabilistic 19.32)

For searching related formulas, see also the Statistical Inference Key Formula Cheat Sheet.