Introduction to Vector/Matrix Calculus
A Gateway to High-Dimensional Optimization
Why Matrix Calculus Is Needed
In modern science and engineering, there is a rapidly growing need to model complex systems mathematically and optimize them computationally. At the heart of this lies matrix calculus. Here, matrix calculus refers to the collective concept of differentiating functions whose variables are scalars, vectors, or matrices. Specifically, it encompasses the gradient vector (differentiating a scalar function with respect to a vector), the Jacobian matrix (differentiating a vector-valued function with respect to a vector), and the derivative of a scalar function with respect to a matrix.
Consider, for example, the training of a neural network. For a model with millions to billions of parameters, one must minimize the loss function $$ L(\boldsymbol{W}_1, \boldsymbol{W}_2, \ldots, \boldsymbol{W}_n) \tag{1} $$ where each $\boldsymbol{W}_i$ is a matrix, and the gradient with respect to all these parameters must be computed. Computing partial derivatives one by one by hand would not finish in any practical amount of time.
The theory of matrix calculus allows us to handle such high-dimensional optimization problems systematically. Specifically:
- Machine learning: Backpropagation in neural networks, gradient computation of loss functions
- Statistics: Maximum likelihood estimation, parameter updates in Gaussian mixture models (GMM), computation of the Fisher information matrix
- Control engineering: Linearization of nonlinear systems, derivation of Kalman filters
- Robotics: Forward and inverse kinematics via Jacobian matrices
- Physics simulations: Energy minimization, computation of deformation gradient tensors
- Computational chemistry: Molecular structure optimization, gradient methods in quantum chemistry
In these fields, operations such as differentiating a scalar function with respect to a vector or matrix, or differentiating a vector-valued function with respect to a vector, are routinely required.
Benefits of Mastering Matrix Calculus
By learning matrix calculus, one gains the following powerful capabilities:
1. Solving High-Dimensional Optimization Efficiently
For minimizing a single-variable function, one simply finds the point where $\displaystyle\frac{df}{dx} = 0$. But what about a function with 10,000-dimensional parameters? Using vector differentiation, one can compute the gradient vector $$ \nabla f = \frac{\partial f}{\partial \boldsymbol{x}} \tag{2} $$ all at once and optimize efficiently using algorithms such as gradient descent.
2. Expressing Complex Formulas Concisely
For example, the gradient of the quadratic form $$ f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} \tag{3} $$ for an $n$-dimensional vector $\boldsymbol{x}$ can be obtained instantly as $$ \frac{\partial f}{\partial \boldsymbol{x}} = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x} \tag{4} $$ using matrix calculus formulas. There is no need to compute $\displaystyle\frac{\partial f}{\partial x_i}$ component by component.
3. Applying the Chain Rule
For complex systems where multiple functions are composed, such as multi-layer neural networks, the chain rule is indispensable. In matrix calculus, the derivative of a composite function $\boldsymbol{z} = \boldsymbol{f}(\boldsymbol{g}(\boldsymbol{x}))$ can be computed (in denominator layout) as $$ \frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} \cdot \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{g}} \tag{5} $$ i.e., as a product of intermediate Jacobian matrices. This allows layer-by-layer computation of gradients through deep computation graphs.
4. Understanding How Automatic Differentiation Works
Machine learning frameworks such as PyTorch, TensorFlow, and JAX provide "automatic differentiation." This is a technique that automatically applies the chain rule in the form of matrix calculus. With an understanding of matrix calculus, one can clearly see what the framework is doing behind the scenes and why the computation graph takes a particular form.
5. Reading State-of-the-Art Papers and Textbooks
In papers on machine learning, statistics, and control engineering, matrix calculus notation is used as a matter of course. For example, the Transformer paper derives the gradient of the attention mechanism in matrix form, the Adam optimizer paper describes the update formula for second-moment matrices, and Kalman filter papers derive the error covariance matrix for state estimation. Without understanding these formulas, a deep understanding or improvement of the algorithms is impossible.
6. Deriving Gradients for New Models on Your Own
When designing a loss function or model not yet implemented in an existing framework, knowledge of matrix calculus allows one to derive the gradient independently. For example, if one wants to add a custom regularization term $$ R(\boldsymbol{W}) = \mathrm{tr}(\boldsymbol{W}^\top \boldsymbol{L} \boldsymbol{W}) \tag{6} $$ the derivative $\displaystyle\frac{\partial R}{\partial \boldsymbol{W}} = (\boldsymbol{L} + \boldsymbol{L}^\top)\boldsymbol{W}$ can be obtained immediately. This is an indispensable skill when developing original algorithms in research or practice.
Fundamental Ideas
The derivative of a single-variable function $f(x)$, written $\displaystyle\frac{df}{dx}$, is a scalar. On the other hand, what does it mean to differentiate a multivariable function $f(x_0, x_1, \ldots, x_{n-1})$ with respect to a vector $\boldsymbol{x}$?
Intuitively, it becomes the vector obtained by arranging the partial derivatives $$ \frac{\partial f}{\partial x_0}, \quad \frac{\partial f}{\partial x_1}, \quad \ldots, \quad \frac{\partial f}{\partial x_{n-1}} \tag{7} $$ This is the gradient vector $\nabla f$.
Here a problem arises: should this vector be a column vector or a row vector? Different conventions are used across mathematics, physics, and engineering, known respectively as "numerator layout" and "denominator layout."
- Denominator layout: The gradient is expressed as a column vector. Predominant in optimization and statistics. The gradient descent update $\boldsymbol{x} \leftarrow \boldsymbol{x} - \alpha \nabla f$ can be written directly.
- Numerator layout: The gradient is expressed as a row vector. Predominant in control engineering and robotics. The chain rule for Jacobian matrices takes a natural form.
Side-by-Side Comparison
The following table shows how the same derivative is expressed under each layout. In every case, the two results are transposes of each other.
| Expression | Numerator layout | Denominator layout |
|---|---|---|
| $\dfrac{\partial y}{\partial \boldsymbol{x}}$ ($y$: scalar, $\boldsymbol{x} \in \mathbb{R}^n$) | $1 \times n$ row vector | $n \times 1$ column vector |
| $\dfrac{\partial \boldsymbol{y}}{\partial x}$ ($\boldsymbol{y} \in \mathbb{R}^m$, $x$: scalar) | $m \times 1$ column vector | $1 \times m$ row vector |
| $\dfrac{\partial \boldsymbol{a}^\top \boldsymbol{z}}{\partial \boldsymbol{z}}$ ($\boldsymbol{a}$: const) | $\boldsymbol{a}^\top$ | $\boldsymbol{a}$ |
| $\dfrac{\partial \boldsymbol{M}\boldsymbol{z}}{\partial \boldsymbol{z}}$ ($\boldsymbol{M}$: const) | $\boldsymbol{M}$ | $\boldsymbol{M}^\top$ |
| $\dfrac{\partial \boldsymbol{z}^\top \boldsymbol{M} \boldsymbol{z}}{\partial \boldsymbol{z}}$ | $\boldsymbol{z}^\top (\boldsymbol{M} + \boldsymbol{M}^\top)$ | $(\boldsymbol{M} + \boldsymbol{M}^\top)\boldsymbol{z}$ |
The number of rows in the result matches the dimension of the variable named by the layout. In numerator layout, the number of rows equals the dimension of the numerator (the function being differentiated). In denominator layout, the number of rows equals the dimension of the denominator (the variable of differentiation).
Chain Rule Order
The two layouts also differ in the order of multiplication when applying the chain rule. Consider the composite function $y = g(h(k(\boldsymbol{x})))$.
- Numerator layout: $\dfrac{\partial y}{\partial \boldsymbol{x}} = \dfrac{\partial y}{\partial g} \dfrac{\partial g}{\partial h} \dfrac{\partial h}{\partial k} \dfrac{\partial k}{\partial \boldsymbol{x}}$ — written left-to-right from the outermost function
- Denominator layout: $\dfrac{\partial y}{\partial \boldsymbol{x}} = \dfrac{\partial k}{\partial \boldsymbol{x}} \dfrac{\partial h}{\partial k} \dfrac{\partial g}{\partial h} \dfrac{\partial y}{\partial g}$ — written left-to-right from the innermost function
In 1-D both orders give the same scalar product, but for vectors and matrices the shapes must align, so the multiplication order matters. This is a common source of confusion when mixing references that use different layouts.
Unfortunately, many references do not explicitly state which notation they adopt. As a result, comparing multiple references can lead to confusion when the same formula yields results that are transposes of each other. Moreover, since matrix multiplication is not commutative, carelessly citing formulas from references using different notations will cause calculations to disagree.
Notation Trends by Field
Different fields tend to favor different notations. In general:
- Denominator layout: Predominant in fields that heavily use gradient descent and energy optimization, such as optimization, statistics, machine learning, and chemistry
- Numerator layout: Predominant in fields that heavily use the chain rule for Jacobian matrices, such as control engineering, robotics, astronomy, and earth science
- Tensor index notation: Used in fields where the distinction between covariant and contravariant components is important, such as general relativity, differential geometry, and materials engineering
A comprehensive table summarizing notation trends across approximately 60 sub-fields, including computer science, physics, engineering, economics, and life sciences, can be found on the following reference page.
Layout Conventions by Field — An exhaustive summary of the typical layouts (denominator, numerator, mixed) used in each field.
Which Notation to Choose
As discussed above, both layouts are widely used. Each has its advantages, and neither is "correct." What matters is choosing a consistent notation that suits the context.
When Denominator Layout Is Appropriate
- Optimization problems: The gradient becomes a column vector, and the update formula $\boldsymbol{x} \leftarrow \boldsymbol{x} - \alpha \nabla f$ can be written directly
- Statistics and machine learning: The gradient $\nabla_{\boldsymbol{\theta}} L$ with respect to the parameter vector $\boldsymbol{\theta}$ is naturally treated as a column vector
- Hessian computation: The second derivative is intuitively expressed as a symmetric matrix
When Numerator Layout Is Appropriate
- Chain rule for Jacobians: The derivative of a composite function is naturally expressed as $\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}} \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}$
- Control engineering and robotics: Frequently used in state-space representation and kinematics
- Coordinate transformations: Deformation gradient tensors and velocity gradient tensors take a natural form
Practical Advice
When writing your own code or papers:
- Check the notation used in the references or framework you are following (PyTorch and NumPy use something close to denominator layout)
- State explicitly at the beginning of your document which layout you use
- Do not switch layouts midway (mixing them causes calculation errors)
- When citing formulas from other references, always check the layout and transpose if necessary
The precise definitions of differentiation results under each layout (matrix shapes, index correspondence, etc.) are discussed in detail in the Formula Sheet, in "1. Notation and Definitions" and "Appendix A. Correspondence with Numerator Layout."
Policy of This Series: Denominator Layout Throughout
For differentiation involving scalars, vectors, and matrices, the difference between the two layouts is just a transpose, and the choice matters little. However, when the differentiation result is a tensor of order 4 or higher, such as in the derivative of a matrix-to-matrix mapping, the situation is different.
In numerator layout, applying the chain rule requires index permutation (a generalization of transposition). In contrast, denominator layout allows adjacent indices to contract directly, making the expressions natural. See "Differences in the Chain Rule" below for details.
Looking at the trends across fields, denominator layout and numerator layout are roughly equally prevalent. Therefore, this series adopts denominator layout throughout, prioritizing consistency with tensor calculus.
This choice also offers the following advantages:
- Consistent with the layout effectively adopted by deep learning frameworks (PyTorch, JAX, TensorFlow)
- Compatible with reverse-mode automatic differentiation (VJP)
- Expressions are less likely to become complicated when dealing with higher-order tensors
Conversion methods for consulting references written in numerator layout are discussed in the Formula Sheet, "Appendix A. Correspondence with Numerator Layout."
When the Layout Difference Becomes Problematic
For vector and scalar differentiation, the difference between denominator and numerator layout is only a transpose, which is relatively easy to handle. However, when differentiating with respect to matrices or applying the chain rule to composite functions, the differences become pronounced.
Example 1: Gradient of a Vector
Consider the gradient of a scalar function $f: \mathbb{R}^n \to \mathbb{R}$.
- Denominator layout: $\displaystyle\frac{\partial f}{\partial \boldsymbol{x}} = \begin{pmatrix} \frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{n \times 1}$ (column vector)
- Numerator layout: $\displaystyle\frac{\partial f}{\partial \boldsymbol{x}} = \begin{pmatrix} \frac{\partial f}{\partial x_1} & \cdots & \frac{\partial f}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{1 \times n}$ (row vector)
In this case, the two are simply transposes of each other. When writing the gradient descent update $\boldsymbol{x} \leftarrow \boldsymbol{x} - \alpha \nabla f$, the denominator layout allows $\nabla f$ to be used directly as a column vector, while the numerator layout requires a transpose.
Example 2: Jacobian Matrix and Chain Rule
Consider the Jacobian matrix of a vector-valued function $\boldsymbol{f}: \mathbb{R}^n \to \mathbb{R}^m$. In component form, it is $\displaystyle\frac{\partial f_i}{\partial x_j}$.
- Denominator layout: $\displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_1} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_1}{\partial x_n} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{n \times m}$
- Numerator layout: $\displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} = \begin{pmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{pmatrix} \in \mathbb{R}^{m \times n}$
For the composite function $\boldsymbol{z} = \boldsymbol{g}(\boldsymbol{f}(\boldsymbol{x}))$, the chain rule takes different multiplication orders depending on the layout.
- Denominator layout: $\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \cdot \frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}}$ (propagation from right to left)
- Numerator layout: $\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}} \cdot \frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}$ (propagation from left to right)
Example 3: Matrix-to-Matrix Mapping (4th-Order Tensor)
When a matrix $\boldsymbol{Y} \in \mathbb{R}^{p \times q}$ is a function of a matrix $\boldsymbol{X} \in \mathbb{R}^{m \times n}$, the derivative $\displaystyle\frac{\partial Y_{ij}}{\partial X_{kl}}$ has four indices. Arranging these yields a 4th-order tensor.
- Denominator layout: Index order is $(i, j, k, l)$, i.e., $\mathbb{R}^{p \times q \times m \times n}$ (output indices $\to$ input indices)
- Numerator layout: Index order is $(k, l, i, j)$, i.e., $\mathbb{R}^{m \times n \times p \times q}$ (input indices $\to$ output indices)
As a concrete example, consider $\boldsymbol{Y} = \boldsymbol{A}\boldsymbol{X}$ ($\boldsymbol{A} \in \mathbb{R}^{p \times m}$). In component form, $Y_{ij} = \sum_k A_{ik} X_{kj}$, so $$ \frac{\partial Y_{ij}}{\partial X_{kl}} = A_{ik} \delta_{jl} $$ where $\delta_{jl}$ is the Kronecker delta.
Example 4: Contraction in the Chain Rule
Consider the composition $\boldsymbol{X} \to \boldsymbol{Y} \to z$ (where $z$ is a scalar). By the chain rule, $$ \frac{\partial z}{\partial X_{kl}} = \sum_{i,j} \frac{\partial z}{\partial Y_{ij}} \frac{\partial Y_{ij}}{\partial X_{kl}} $$ The summation (contraction) is taken over the indices $(i, j)$ of $\boldsymbol{Y}$.
In denominator layout, the indices of $\displaystyle\frac{\partial z}{\partial \boldsymbol{Y}}$ are $(i, j)$, and the leading indices of $\displaystyle\frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}}$ are also $(i, j)$, so the contraction proceeds naturally, just like matrix multiplication.
Contraction in Numerator Layout
In numerator layout, the index ordering differs from denominator layout, so index permutation is needed before contraction. Let us examine this concretely.
Under the numerator layout definition, the output indices come first and the input indices come second:
$$ \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{num}}_{ijkl} = \frac{\partial Y_{ij}}{\partial X_{kl}} $$
In denominator layout, the input indices come first:
$$ \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{den}}_{klij} = \frac{\partial Y_{ij}}{\partial X_{kl}} $$
The two are related by the index permutation $(i,j,k,l) \to (k,l,i,j)$:
$$ \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{den}}_{klij} = \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{num}}_{ijkl} $$
Differences in the Chain Rule
Consider the composition $z = f(\boldsymbol{Y})$, $\boldsymbol{Y} = g(\boldsymbol{X})$, and apply the chain rule.
Denominator layout:
$$ \left( \frac{\partial z}{\partial \boldsymbol{X}} \right)^{\text{den}}_{kl} = \sum_{i,j} \left( \frac{\partial z}{\partial \boldsymbol{Y}} \right)^{\text{den}}_{ij} \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{den}}_{ijkl} $$
Here, the indices $(i,j)$ of $\displaystyle\left( \frac{\partial z}{\partial \boldsymbol{Y}} \right)^{\text{den}}_{ij}$ match the leading indices $(i,j)$ of $\displaystyle\left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{den}}_{ijkl}$, so the contraction proceeds directly.
Numerator layout:
$$ \left( \frac{\partial z}{\partial \boldsymbol{Y}} \right)^{\text{num}}_{ij} = \frac{\partial z}{\partial Y_{ij}} $$
$$ \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{num}}_{ijkl} = \frac{\partial Y_{ij}}{\partial X_{kl}} $$
Applying the chain rule:
$$ \left( \frac{\partial z}{\partial \boldsymbol{X}} \right)^{\text{num}}_{kl} = \sum_{i,j} \frac{\partial z}{\partial Y_{ij}} \frac{\partial Y_{ij}}{\partial X_{kl}} = \sum_{i,j} \left( \frac{\partial z}{\partial \boldsymbol{Y}} \right)^{\text{num}}_{ij} \left( \frac{\partial \boldsymbol{Y}}{\partial \boldsymbol{X}} \right)^{\text{num}}_{ijkl} $$
At first glance this looks identical, but in numerator layout $(i,j)$ are placed as "output-side indices," so one must explicitly be aware of the index correspondence when performing the contraction.
Comparison with the Matrix Case
In the case of 2nd-order tensors (matrices), this index rearrangement is simply a transpose. For example, in the chain rule for a vector $\boldsymbol{y} = f(\boldsymbol{x})$:
Denominator layout: $$ \frac{\partial z}{\partial \boldsymbol{x}} = \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \frac{\partial z}{\partial \boldsymbol{y}} $$ (Jacobian matrix $\times$ column vector)
Numerator layout: $$ \frac{\partial z}{\partial \boldsymbol{x}} = \frac{\partial z}{\partial \boldsymbol{y}} \frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} $$ (row vector $\times$ Jacobian matrix)
In the matrix case, one can convert between the two by swapping the multiplication order and taking the transpose.
However, for tensors of order 4 or higher, a simple "transpose" is insufficient; a general index permutation is needed. For example, in the 4th-order tensor above, the permutation is $(i,j,k,l) \to (k,l,i,j)$, which is not a simple transpose (swapping two indices) but rather an operation that swaps the first and second halves of the four indices.
For even higher-order tensors (e.g., 6th-order), the permutation patterns become increasingly complex, making the expressions harder to follow. For this reason, denominator layout is often preferred when working with higher-order tensors.
Relationship with Deep Learning Frameworks
In deep learning frameworks such as PyTorch, JAX, and TensorFlow,
the gradients returned by .backward() or grad() for a scalar loss function always have the same shape as the input.
This is consistent with the denominator layout convention.
For example, if the input is a $(3, 4)$ matrix and the output is a scalar, the gradient is also returned as a $(3, 4)$ matrix. This is precisely the denominator layout definition: "differentiating a scalar with respect to a matrix yields a matrix of the same shape as the input."
A generalized discussion of matrix calculus from the tensor perspective can be found in Introduction to Tensor Calculus (coming soon). The relationship with automatic differentiation is covered in detail in Automatic Differentiation and Optimization (coming soon).
References and Related Articles
- Matrix calculus — Wikipedia
- The Matrix Cookbook — Petersen & Pedersen (2012)
- Gaussian Mixture Models (GMM) and EM Algorithm