Vector/Matrix Calculus Formula Sheet
Matrix Calculus Formulas
This document is a collection of formulas for multivariable function derivatives (vector calculus, matrix calculus) used in machine learning, statistics, optimization theory, control engineering, signal processing, and econometrics.
1. Overview
While multivariate analysis and tensor calculus can also handle derivatives, matrix calculus offers the following advantages as an independent framework:
- Index-free notation (index-free calculus)
- Algebraic manipulation via the vec operator and Kronecker product
- Closed-form gradient derivation for matrix functions in machine learning and statistics
Matrix calculus lies at the intersection of analysis (multivariate differentiation), linear algebra (matrix operations), and tensor analysis (multilinear algebra). In finite-dimensional Euclidean spaces, the Gâteaux derivative and Fréchet derivative coincide, and the derivatives in this formula sheet are consistent with the Fréchet derivative[1].
There are two main approaches to matrix calculus. One is the coordinate-free approach that treats derivatives tensorially without coordinates, and the other is the component-wise approach that explicitly writes out components. This formula sheet takes the latter approach, providing concrete component expressions for practical computation. From the coordinate-free perspective, matrix calculus is a special case of tensor calculus (differentiation with respect to second-order tensors). In that setting, gradients and Hessians are treated as covariant tensors. For the coordinate-free formulation, see Introduction to Tensor Calculus.
Throughout this series, the standard Euclidean inner product is assumed unless otherwise stated. This metric identifies the tangent and cotangent spaces, allowing the gradient to be treated as a vector. For the matrix space $\mathbb{R}^{m \times n}$, the Frobenius inner product $\langle \boldsymbol{A}, \boldsymbol{B} \rangle_F = \mathrm{tr}(\boldsymbol{A}^\top \boldsymbol{B})$ identifies the dual space $(\mathbb{R}^{m \times n})^*$ with $\mathbb{R}^{m \times n}$.
This series is restricted to functions on finite-dimensional Euclidean spaces $\mathbb{R}^n$. Extensions to Banach and Hilbert spaces are discussed in the Proof Collection, Chapter 1.
A mapping $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ is differentiable at $\boldsymbol{X}$ if there exists a linear mapping $Df(\boldsymbol{X}): \mathbb{R}^{m \times n} \to \mathbb{R}$ such that $$\lim_{\|\boldsymbol{H}\|_F \to 0} \frac{|f(\boldsymbol{X}+\boldsymbol{H}) - f(\boldsymbol{X}) - Df(\boldsymbol{X})[\boldsymbol{H}]|}{\|\boldsymbol{H}\|_F} = 0$$ The matrix $\nabla f(\boldsymbol{X}) \in \mathbb{R}^{m \times n}$ satisfying $Df(\boldsymbol{X})[\boldsymbol{H}] = \langle \nabla f(\boldsymbol{X}), \boldsymbol{H} \rangle_F$ via the Frobenius inner product is called the gradient of $f$.
The differential $Df(\boldsymbol{X})$ is a coordinate-independent linear functional (an element of the cotangent space $T_{\boldsymbol{X}}^* M$), and via the Riesz representation theorem it corresponds to the gradient $\nabla f(\boldsymbol{X})$ (an element of the tangent space $T_{\boldsymbol{X}} M$) through the inner product. The concrete form of the gradient depends on the choice of metric (inner product).
1.1 Notation and Definitions
When expressing derivatives of multivariable functions as matrices or vectors, there are two conventions: "denominator layout" and "numerator layout." This document adopts the denominator layout. For the differences between the two and field-specific conventions, see the Matrix Calculus Notation Guide.
In the denominator layout, the derivative result is defined as a matrix where "the dimension of the variable in the denominator corresponds to rows, and the dimension of the variable in the numerator corresponds to columns." This convention is widely used in machine learning, statistics, optimization theory, and econometrics, and has advantages such as the gradient vector naturally being a column vector.
The denominator layout is adopted in the standard textbook by Magnus & Neudecker and is mainstream in the statistics and econometrics literature. On the other hand, some engineering texts and machine learning frameworks (e.g., PyTorch's autograd) use the numerator layout. The two are related by transposition, and the formulas in this document can be converted to numerator layout by transposing. See Appendix A for details.
- Scalars: $a, b, c, \ldots$ or $x, y, z, u, v, w$ (lowercase italic)
- Vectors: $\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}, \ldots$ or $\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z}$ (lowercase bold)
- Matrices: $\boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}, \ldots$ or $\boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{Z}$ (uppercase bold)
- Logarithm: $\log$ denotes the natural logarithm (base $e$); base-$a$ logarithm is written as $\log_a$
- Single-entry matrix: $\boldsymbol{J}^{ij}$ is the matrix with 1 at position $(i,j)$ and 0 elsewhere
- Gradient: $\nabla f$ or $\displaystyle\frac{\partial f}{\partial \boldsymbol{x}}$ (derivative of scalar $f$ w.r.t. vector $\boldsymbol{x}$, a column vector)
- Jacobian: $\displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}$ (derivative of vector $\boldsymbol{y}$ w.r.t. vector $\boldsymbol{x}$, a matrix)
- Hessian: $\displaystyle\frac{\partial^2 f}{\partial \boldsymbol{x} \partial \boldsymbol{x}^\top}$ or $\boldsymbol{H}$ (second derivative of scalar $f$, a symmetric matrix)
- Indexing: 0-based ($x_0, x_1, \ldots, x_{N-1}$ where $i = 0, \ldots, N-1$)
1.1.1 Derivative of Scalar by Vector
The derivative of a scalar $y$ with respect to an $N$-dimensional vector $\boldsymbol{x}$ is a column vector.
In the denominator layout, the derivative of a scalar-valued function with respect to a vector is defined as a column vector. This is so that the gradient vector can be directly used as the update direction in optimization.
1.1.2 Derivative of Vector by Scalar
In the denominator layout, the derivative of a vector $\boldsymbol{y}$ (column vector) with respect to a scalar $x$ is a row vector. This follows the rule "the numerator's index determines the columns, the denominator's index determines the rows." The numerator $\boldsymbol{y}$'s index $j$ is arranged in the column direction, and since the denominator $x$ has no index, the result is a $1 \times M$ row vector.
Here $\boldsymbol{y} = (y_0, y_1, \ldots, y_{M-1})^\top$ is an $M$-dimensional column vector. Following the denominator layout rule "denominator dimension × numerator dimension," $\partial \boldsymbol{y}/\partial x \in \mathbb{R}^{1 \times M}$ ($x$'s dimension 1 × $\boldsymbol{y}$'s dimension $M$).
1.1.3 Derivative of Vector by Vector
This is an $N \times M$ matrix, and its $(i, j)$ entry is $\displaystyle\frac{\partial y_j}{\partial x_i}$.
1.1.4 Derivative of Scalar by Matrix
The derivative of a scalar function $f(\boldsymbol{X})$ with respect to an $m \times n$ matrix $\boldsymbol{X}$ is an $m \times n$ matrix whose $(i,j)$ entry is $\displaystyle\frac{\partial f}{\partial X_{ij}}$.
By this definition, the gradient matrix $\displaystyle\frac{\partial f}{\partial \boldsymbol{X}}$ has the same size as the original matrix $\boldsymbol{X}$. This is convenient for optimization algorithms like gradient descent, where $\boldsymbol{X} \leftarrow \boldsymbol{X} - \alpha \displaystyle\frac{\partial f}{\partial \boldsymbol{X}}$ can be written naturally.
1.2 Jacobian Matrix and Chain Rule
Consider the derivative of an $M$-dimensional vector-valued function $\boldsymbol{y}(\boldsymbol{x})$ with respect to an $N$-dimensional vector $\boldsymbol{x}$.
1.2.1 Definition of the Jacobian
In the denominator layout, the Jacobian is an $N \times M$ matrix:
That is, the $(i, j)$ entry is $\displaystyle\frac{\partial y_j}{\partial x_i}$. This means the "denominator" $\boldsymbol{x}$'s index determines the row, and the "numerator" $\boldsymbol{y}$'s index determines the column.
In the denominator layout, the Jacobian of a vector-valued function $\boldsymbol{y}: \mathbb{R}^N \to \mathbb{R}^M$ is $$\displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \in \mathbb{R}^{N \times M}$$ ("denominator dimension × numerator dimension"). This is the transpose of the Jacobian $\in \mathbb{R}^{M \times N}$ defined in the numerator layout.
1.2.2 Relation to Scalar Derivatives
When $\boldsymbol{y}$ is 1-dimensional (scalar $y$), the Jacobian reduces to an $N \times 1$ column vector, which equals the gradient:
1.2.3 Chain Rule
Consider the derivative of a composite function $\boldsymbol{z}(\boldsymbol{y}(\boldsymbol{x}))$. Here $\boldsymbol{x}$ is $N$-dimensional, $\boldsymbol{y}$ is $M$-dimensional, and $\boldsymbol{z}$ is $L$-dimensional.
1.2.3.1 Vector Chain Rule
Differentiating the $l$-th component $z_l$ of $\boldsymbol{z}$ with respect to the $i$-th component $x_i$ of $\boldsymbol{x}$, the usual multivariate chain rule gives:
In the denominator layout, $\displaystyle\left(\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right)_{il} = \displaystyle\frac{\partial z_l}{\partial x_i}$, so:
This is precisely the definition of matrix multiplication:
The dimensions are $(N \times M) \cdot (M \times L) = N \times L$, as expected.
In the denominator layout, the rows and columns of the Jacobian correspond to the denominator and numerator dimensions, respectively. Therefore, the order of matrix multiplication in the chain rule is uniquely determined by dimensional consistency. Note that reversing the order makes the product undefined.
1.2.3.2 Scalar Output Case
When $z$ is a scalar ($L = 1$), $\displaystyle\frac{\partial z}{\partial \boldsymbol{y}}$ is an $M \times 1$ column vector (gradient):
The dimensions are $(N \times M) \cdot (M \times 1) = N \times 1$, yielding the gradient with respect to $\boldsymbol{x}$.
1.2.3.3 Element-wise Functions
Let $f$ be a scalar function and consider $\boldsymbol{y} = (f(u_0), f(u_1), \ldots, f(u_{M-1}))^\top$ where $f$ is applied to each element of $\boldsymbol{u}$. Since $y_j = f(u_j)$:
Therefore:
Combined with the chain rule:
2. Derivative of Scalar by Vector
Formulas for differentiating a scalar function $f$ with respect to a vector $\boldsymbol{x}$. Here $a$ is a scalar constant, $\boldsymbol{a}, \boldsymbol{b}$ are constant vectors, and $\boldsymbol{A}$ is a constant matrix. See Proof Collection, Chapter 2 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial a}{\partial \boldsymbol{x}} = \boldsymbol{0}$ | $a$ is constant | 2.1 |
| $\displaystyle\frac{\partial (\boldsymbol{a}^\top \boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{a}$ | 2.2 | |
| $\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{a})}{\partial \boldsymbol{x}} = \boldsymbol{a}$ | 2.2 | |
| $\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\boldsymbol{x}$ | 2.3 | |
| $\displaystyle\frac{\partial (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top \boldsymbol{b}$ | Bilinear form | 2.4 |
| $\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x}$ | Quadratic form | 2.5 |
| $\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\boldsymbol{A} \boldsymbol{x}$ | $\boldsymbol{A}$ symmetric | 2.5 |
| $\displaystyle\frac{\partial \|\boldsymbol{x} - \boldsymbol{a}\|}{\partial \boldsymbol{x}} = \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|}$ | 2-norm | 2.6 |
| $\displaystyle\frac{\partial \|\boldsymbol{x} - \boldsymbol{a}\|^2}{\partial \boldsymbol{x}} = 2(\boldsymbol{x} - \boldsymbol{a})$ | Squared 2-norm | 2.7 |
| $\displaystyle\frac{\partial (uv)}{\partial \boldsymbol{x}} = \displaystyle u \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} + v \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ | Product rule | 2.9 |
| $\displaystyle\frac{\partial (\boldsymbol{u}^\top \boldsymbol{v})}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \boldsymbol{v} + \displaystyle\frac{\partial \boldsymbol{v}}{\partial \boldsymbol{x}} \boldsymbol{u}$ | Inner product rule | 2.8 |
| $\displaystyle\frac{\partial (f + g)}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial g}{\partial \boldsymbol{x}}$ | Sum rule | 2.10 |
| $\displaystyle\frac{\partial (cf)}{\partial \boldsymbol{x}} = \displaystyle c \displaystyle\frac{\partial f}{\partial \boldsymbol{x}}$ | Scalar multiplication | 2.11 |
| $\displaystyle\frac{\partial (u/v)}{\partial \boldsymbol{x}} = \displaystyle\frac{1}{v^2}\left( v \displaystyle\frac{\partial u}{\partial \boldsymbol{x}} - u \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} \right)$ | Quotient rule | 2.12 |
| $\displaystyle\frac{\partial (1/u)}{\partial \boldsymbol{x}} = \displaystyle -\displaystyle\frac{1}{u^2} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ | Reciprocal | 2.13 |
| $\displaystyle\frac{\partial u^n}{\partial \boldsymbol{x}} = \displaystyle n u^{n-1} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ | Power rule | 2.14 |
| $\displaystyle\frac{\partial e^u}{\partial \boldsymbol{x}} = \displaystyle e^u \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ | Exponential | 2.15 |
| $\displaystyle\frac{\partial \log u}{\partial \boldsymbol{x}} = \displaystyle \displaystyle\frac{1}{u} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ | Logarithm | 2.16 |
3. Derivative of Vector by Vector
Formulas for differentiating a vector function $\boldsymbol{y}$ with respect to a vector $\boldsymbol{x}$. The result is a Jacobian matrix ($N \times M$). See Proof Collection, Chapter 3 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}} = \boldsymbol{I}$ | Identity | 3.1 |
| $\displaystyle\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{x}} = \boldsymbol{O}$ | Constant vector | 3.3 |
| $\displaystyle\frac{\partial (\boldsymbol{A}\boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top$ | Linear transformation | 3.2 |
| $\displaystyle\frac{\partial (\boldsymbol{A}\boldsymbol{x} + \boldsymbol{b})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top$ | Affine transformation | 3.4 |
| $\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A})}{\partial \boldsymbol{x}} = \boldsymbol{A}$ | Transposed linear | 3.5 |
| $\displaystyle\frac{\partial (\boldsymbol{u} + \boldsymbol{v})}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial \boldsymbol{v}}{\partial \boldsymbol{x}}$ | Sum rule | 3.6 |
| $\displaystyle\frac{\partial (v \boldsymbol{u})}{\partial \boldsymbol{x}} = \displaystyle v \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} \boldsymbol{u}^\top$ | Product rule (scalar × vector) | 3.7 |
| $\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\text{diag}(\boldsymbol{x})$ | Element-wise square | 3.8 |
| $\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{y})}{\partial \boldsymbol{z}} = \displaystyle\text{diag}(\boldsymbol{x}) \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{z}} + \text{diag}(\boldsymbol{y}) \displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{z}}$ | Hadamard product $\boldsymbol{x} = \boldsymbol{x}(\boldsymbol{z}),\ \boldsymbol{y} = \boldsymbol{y}(\boldsymbol{z})$ |
3.12 |
| $\displaystyle\frac{d(\boldsymbol{x} \times \boldsymbol{y})}{dt} = \displaystyle\frac{d\boldsymbol{x}}{dt} \times \boldsymbol{y} + \boldsymbol{x} \times \displaystyle\frac{d\boldsymbol{y}}{dt}$ | Time derivative of cross product | 3.9 |
| $\displaystyle\frac{d\|\boldsymbol{x}(t)\|}{dt} = \displaystyle\frac{\boldsymbol{x}}{\|\boldsymbol{x}\|} \cdot \displaystyle\frac{d\boldsymbol{x}}{dt}$ | Time derivative of 2-norm | 3.10 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{u}} \begin{pmatrix} f(u_0) \\ \vdots \\ f(u_{N-1}) \end{pmatrix} = \text{diag}\begin{pmatrix} f'(u_0) \\ \vdots \\ f'(u_{N-1}) \end{pmatrix}$ | Element-wise function | 3.11 |
| $\displaystyle\frac{\partial \text{softmax}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top$ | Softmax ($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$) | 3.13 |
4. Basic Matrix Derivative Formulas
Basic formulas for differentiating a scalar function $f(\boldsymbol{X})$ with respect to a matrix $\boldsymbol{X}$. Here $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{a}, \boldsymbol{b}$ are constant vectors. See Proof Collection, Chapter 4 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \boldsymbol{a} \boldsymbol{b}^\top$ | Bilinear form | 4.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b}) = \boldsymbol{b} \boldsymbol{a}^\top$ | Transposed bilinear form | 4.2 |
| $\displaystyle\frac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij}$ | Component derivative | 4.3 |
| $\displaystyle\frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} = (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij}$ | Product component derivative | 4.4 |
| $\displaystyle\frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} = (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij}$ | Transposed product component | 4.5 |
5. Trace Derivatives
Derivative formulas for scalar functions involving the trace $\text{tr}(\cdot)$. See Proof Collection, Chapter 5 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I}$ | Trace | 5.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top$ | Trace | 5.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{A}^\top$ | Trace | 5.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}$ | Trace | 5.4 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}$ | Trace | 5.5 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$ | Trace | 5.6 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top$ | Quadratic trace | 5.7 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$ | Quadratic form trace | 5.8 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$ | Quadratic form trace | 5.9 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$ | Quadratic form trace | 5.10 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$ | Quadratic form trace | 5.11 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$ | Quadratic form trace | 5.12 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$ | Quadratic form trace | 5.13 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top + \boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top$ | Quadratic form trace | 5.14 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = 2\boldsymbol{X}$ | Frobenius norm | 5.15 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$ | Frobenius norm | 5.16 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$ | Quadratic form trace | 5.17 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \boldsymbol{B}\boldsymbol{X}\boldsymbol{C} + \boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top$ | Quadratic form trace | 5.18 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$ | Quadratic form trace | 5.19 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})^\top] = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})\boldsymbol{B}^\top$ | Quadratic form trace | 5.20 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = 2\text{tr}(\boldsymbol{X})\boldsymbol{I}$ | Kronecker product trace | 5.21 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{X}$ | Quadratic form trace | 5.22 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{A}^\top \boldsymbol{B}^\top$ | Product trace | 5.23 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \boldsymbol{B}\boldsymbol{A}$ | Transposed product trace | 5.24 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A})\boldsymbol{I}$ | Kronecker product trace | 5.25 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A}) = -\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top}$ | Inverse trace | 5.26 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$ | Higher-order trace | 5.27 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$ | Higher-order trace | 5.28 |
| \begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) \notag\\&= \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X} \notag\\&\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X} + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top \notag\end{align} | Higher-order quadratic form | 5.29 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$ | Inverse trace | 5.30 |
| \begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] \notag\\&= -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{C}$: symmetric) | Quadratic form inverse trace | 5.31 |
| \begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] \notag\\&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\\&\quad +2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{B}, \boldsymbol{C}$: symmetric) | Quadratic form inverse trace | 5.32 |
| \begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] \notag\\&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X} \notag\\&\qquad \times(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} +2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{B}, \boldsymbol{C}$: symmetric) | Quadratic form inverse trace | 5.33 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X})) = \exp(\boldsymbol{X})^\top$ | Exponential trace | 5.34 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X})) = \boldsymbol{X}^{-\top}$ | Logarithm trace | 5.35 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sqrt{\boldsymbol{X}}) = \displaystyle\frac{1}{2}(\boldsymbol{X}^{-1/2})^\top$ ($\boldsymbol{X}$: positive definite) | Square root trace | 5.36 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X})) = \cos(\boldsymbol{X})^\top$ | Sine trace | 5.37 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X})) = -\sin(\boldsymbol{X})^\top$ | Cosine trace | 5.38 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top$ | Tangent trace | 5.39 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$ | Arcsine trace | 5.40 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$ | Arccosine trace | 5.41 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$ | Arctangent trace | 5.42 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X})) = \cosh(\boldsymbol{X})^\top$ | Hyperbolic sine trace | 5.43 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X})) = \sinh(\boldsymbol{X})^\top$ | Hyperbolic cosine trace | 5.44 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top$ | Hyperbolic tangent trace | 5.45 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top$ | Inverse hyperbolic sine trace | 5.46 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top$ | Inverse hyperbolic cosine trace | 5.47 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top$ | Inverse hyperbolic tangent trace | 5.48 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) = (\boldsymbol{A}\exp(\boldsymbol{X}))^\top$ | Matrix-coefficient exponential trace | 5.50 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) = (\boldsymbol{A}\cos(\boldsymbol{X}))^\top$ | Matrix-coefficient sine trace | 5.49 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) = -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top$ | Matrix-coefficient cosine trace | 5.51 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X})) = (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top$ | Matrix-coefficient tangent trace | 5.52 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$ | Matrix-coefficient arcsine trace | 5.53 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X})) = -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$ | Matrix-coefficient arccosine trace | 5.54 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$ | Matrix-coefficient arctangent trace | 5.55 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) = (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top$ | Matrix-coefficient hyp. sine trace | 5.56 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) = (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top$ | Matrix-coefficient hyp. cosine trace | 5.57 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\text{sech}^2(\boldsymbol{X}))^\top$ | Matrix-coefficient hyp. tangent trace | 5.58 |
6. Hadamard Product and Activation Functions
Derivative formulas for the element-wise (Hadamard) product and activation functions commonly used in machine learning. See Proof Collection, Chapter 6 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{y})}{\partial \boldsymbol{z}} = \text{diag}(\boldsymbol{x}) \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{z}} + \text{diag}(\boldsymbol{y}) \displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{z}}$ | Hadamard product | 6.1 |
| $\displaystyle\frac{\partial \text{softmax}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top$ ($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$) | Softmax Jacobian | 6.2 |
| $\displaystyle\frac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$ ($\sigma(x) = \displaystyle\frac{1}{1+e^{-x}}$) | Sigmoid | 6.3 |
| $\displaystyle\frac{\partial \tanh(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(1 - \tanh^2(\boldsymbol{x}))$ | Tanh | 6.4 |
| $\displaystyle\frac{\partial \text{ReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\mathbf{1}_{x_i > 0})$ ($\text{ReLU}(x) = \max(0, x)$) | ReLU (subgradient at $x = 0$) | 6.5 |
| $\displaystyle\frac{\partial \text{LeakyReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\mathbf{1}_{x_i > 0} + \alpha \cdot \mathbf{1}_{x_i \leq 0})$ ($\text{LeakyReLU}(x) = \max(\alpha x, x)$) | Leaky ReLU | 6.6 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (-\boldsymbol{y}^\top \log \boldsymbol{p}) = \boldsymbol{p} - \boldsymbol{y}$ ($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$) | Cross-entropy loss (softmax + CE) | 6.7 |
| $\displaystyle\frac{\partial}{\partial x} \text{BCE}(y, \sigma(x)) = \sigma(x) - y$ (BCE = $-y\log\sigma(x) - (1-y)\log(1-\sigma(x))$) | Binary cross-entropy (sigmoid + BCE) | 6.8 |
| $\displaystyle\frac{\partial \text{GELU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\Phi(\boldsymbol{x}) + \boldsymbol{x} \odot \phi(\boldsymbol{x}))$ ($\text{GELU}(x) = x \cdot \Phi(x)$) | GELU $\Phi$: standard normal CDF $\phi$: standard normal PDF | 6.9 |
| $\displaystyle\frac{\partial \text{Swish}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\sigma(\boldsymbol{x}) + \boldsymbol{x} \odot \sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$ ($\text{Swish}(x) = x \cdot \sigma(x)$) | Swish (SiLU) | 6.10 |
7. Determinant Derivatives
Derivative formulas for the determinant $\det(\boldsymbol{X})$ and related functions. See Proof Collection, Chapter 7 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}| = |\boldsymbol{X}| \boldsymbol{X}^{-\top}$ | Determinant | 7.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log|\boldsymbol{X}| = \boldsymbol{X}^{-\top}$ | Log-determinant | 7.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^n| = n|\boldsymbol{X}^n| \boldsymbol{X}^{-\top}$ | Power of determinant | 7.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}| = |\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}| \boldsymbol{X}^{-\top}$ ($\boldsymbol{A}, \boldsymbol{B}$: square invertible) | Product determinant | 7.7 |
| $\displaystyle\sum_{k} \displaystyle\frac{\partial |\boldsymbol{X}|}{\partial X_{ik}} X_{jk} = \delta_{ij} |\boldsymbol{X}|$ | Cofactor expansion property | 7.5 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| = 2|\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| \boldsymbol{X}^{-\top}$ ($\boldsymbol{X}$: square invertible) | Quadratic form determinant | 7.8.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| = 2|\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| \boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})^{-1}$ ($\boldsymbol{X}$: non-square, $\boldsymbol{A}$: symmetric, $\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}$: invertible) | Quadratic form determinant | 7.8.2 |
| \begin{align}\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| &= |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}|(\boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})^{-1} \notag\\&\quad + \boldsymbol{A}^\top \boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A}^\top \boldsymbol{X})^{-1}) \notag\end{align}($\boldsymbol{X}$: non-square, $\boldsymbol{A}$: general, $\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}$: invertible) | Quadratic form determinant | 7.8.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log |\boldsymbol{X}^\top \boldsymbol{X}| = 2(\boldsymbol{X}^{+})^\top$ | Gram matrix log-determinant | 7.9.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}^{+}} \log |\boldsymbol{X}^\top \boldsymbol{X}| = -2\boldsymbol{X}^\top$ | Pseudo-inverse derivative | 7.9.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log |\det(\boldsymbol{X})| = (\boldsymbol{X}^{-1})^\top = (\boldsymbol{X}^\top)^{-1}$ | Log absolute determinant | 7.10 |
8. Inverse Matrix Derivatives
Derivative formulas for functions involving the inverse $\boldsymbol{X}^{-1}$. See Proof Collection, Chapter 8 for proofs.
8.1 Regular Inverse Matrix Derivatives
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial (\boldsymbol{X}^{-1})_{kl}}{\partial X_{ij}} = -(\boldsymbol{X}^{-1})_{ki}(\boldsymbol{X}^{-1})_{jl}$ | Inverse component derivative | 8.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top \boldsymbol{X}^{-1} \boldsymbol{b} = -\boldsymbol{X}^{-\top} \boldsymbol{a} \boldsymbol{b}^\top \boldsymbol{X}^{-\top}$ | Quadratic form with inverse | 8.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^{-1}| = -|\boldsymbol{X}^{-1}|(\boldsymbol{X}^{-1})^\top$ | Determinant of inverse | 8.4 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -(\boldsymbol{X}^{-1}\boldsymbol{B}\boldsymbol{A}\boldsymbol{X}^{-1})^\top$ | Trace with inverse | 8.5 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}((\boldsymbol{X}+\boldsymbol{A})^{-1}) = -((\boldsymbol{X}+\boldsymbol{A})^{-1}(\boldsymbol{X}+\boldsymbol{A})^{-1})^\top$ | Trace of sum inverse | 8.6 |
| $\displaystyle\frac{\partial J}{\partial \boldsymbol{A}} = -\boldsymbol{A}^{-\top} \displaystyle\frac{\partial J}{\partial \boldsymbol{W}} \boldsymbol{A}^{-\top}$ (where $\boldsymbol{W} = \boldsymbol{A}^{-1}$) | Inverse chain rule | 8.7 |
| $\displaystyle\frac{\partial}{\partial A_{ij}} (\boldsymbol{I} - \boldsymbol{A})^{-1} = \boldsymbol{L} \boldsymbol{E}_{ij} \boldsymbol{L}$ ($\boldsymbol{L} = (\boldsymbol{I} - \boldsymbol{A})^{-1}$: Leontief inverse) ($\boldsymbol{E}_{ij}$: matrix with 1 at $(i,j)$ only) | Leontief inverse derivative (input-output analysis) | 8.8 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{A}} \text{tr}((\boldsymbol{I} - \boldsymbol{A})^{-1}) = ((\boldsymbol{I} - \boldsymbol{A})^{-1}(\boldsymbol{I} - \boldsymbol{A})^{-1})^\top$ | Leontief inverse trace | 8.9 |
8.2 Moore-Penrose Pseudoinverse Derivatives
Derivatives of the Moore-Penrose pseudoinverse $\boldsymbol{X}^+ \in \mathbb{R}^{n \times m}$ ($\boldsymbol{X} \in \mathbb{R}^{m \times n}$). Used in robotics (redundant manipulators), least squares, and signal processing. $\boldsymbol{X}^+$ satisfies $\boldsymbol{X}\boldsymbol{X}^+\boldsymbol{X} = \boldsymbol{X}$, $\boldsymbol{X}^+\boldsymbol{X}\boldsymbol{X}^+ = \boldsymbol{X}^+$, etc.
| Formula | Notes | Proof |
|---|---|---|
| \begin{align}d\boldsymbol{X}^+ &= -\boldsymbol{X}^+ (d\boldsymbol{X}) \boldsymbol{X}^+ + \boldsymbol{X}^{+\top}\boldsymbol{X}^\top (d\boldsymbol{X})^\top (\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) \notag\\&\quad + (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})(d\boldsymbol{X})^\top \boldsymbol{X}^{+\top}\boldsymbol{X}^+ \notag\end{align}(full rank, $m \le n$) | Golub-Pereyra formula | 8.10 |
| $d\boldsymbol{X}^+ = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}(d\boldsymbol{X})^\top(\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) - \boldsymbol{X}^+(d\boldsymbol{X})\boldsymbol{X}^+$ (full rank, $m \ge n$, full column rank) | Left inverse type | 8.11 |
| $d\boldsymbol{X}^+ = (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})(d\boldsymbol{X})^\top(\boldsymbol{X}\boldsymbol{X}^\top)^{-1} - \boldsymbol{X}^+(d\boldsymbol{X})\boldsymbol{X}^+$ (full rank, $m \le n$, full row rank) | Right inverse type | 8.12 |
| \begin{align}\frac{d\boldsymbol{X}^+}{dt} &= -\boldsymbol{X}^+ \dot{\boldsymbol{X}} \boldsymbol{X}^+ + \boldsymbol{X}^{+\top}\boldsymbol{X}^\top \dot{\boldsymbol{X}}^\top (\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) \notag\\&\quad + (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})\dot{\boldsymbol{X}}^\top \boldsymbol{X}^{+\top}\boldsymbol{X}^+ \notag\end{align}(time derivative) | Used for robot Jacobian time derivative | 8.13 |
| $\boldsymbol{X}^+ = \boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{X}^\top)^{-1}$ (full row rank) | Right inverse | — |
| $\boldsymbol{X}^+ = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top$ (full column rank) | Left inverse | — |
In robotics, the pseudoinverse $\boldsymbol{J}^+$ of the Jacobian $\boldsymbol{J}(\boldsymbol{q})$ is used to solve inverse kinematics: $\dot{\boldsymbol{q}} = \boldsymbol{J}^+ \dot{\boldsymbol{x}}$. For redundant manipulators (joints $n$ > workspace dimension $m$), $\boldsymbol{J}^+$ gives the minimum-norm solution.
9. Eigenvalue/Eigenvector Derivatives
Derivative formulas for eigenvalues $\lambda_i$ and eigenvectors $\boldsymbol{v}_i$. See Proof Collection, Chapter 9 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \sum_i \lambda_i(\boldsymbol{X}) = \boldsymbol{I}$ | Sum of eigenvalues | 9.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \prod_i \lambda_i(\boldsymbol{X}) = \det(\boldsymbol{X}) \boldsymbol{X}^{-\top}$ | Product of eigenvalues | 9.2 |
| $\partial \lambda_i = \boldsymbol{v}_i^\top \partial\boldsymbol{A} \, \boldsymbol{v}_i$ ($\boldsymbol{A}$: real symmetric) | Eigenvalue derivative | 9.3 |
| $\partial \boldsymbol{v}_i = (\lambda_i \boldsymbol{I} - \boldsymbol{A})^+ \partial\boldsymbol{A} \, \boldsymbol{v}_i$ ($\boldsymbol{A}$: real symmetric) | Eigenvector derivative | 9.4 |
10. Quadratic Form Derivatives
Derivative formulas for vector and matrix quadratic forms. See Proof Collection, Chapter 10 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{a} \boldsymbol{a}^\top$ | Matrix quadratic form | 10.1 |
| $\displaystyle\frac{\partial}{\partial X_{ij}} \left(\sum_{k,l} X_{kl}\right)^2 = 2 \displaystyle\sum_{k,l} X_{kl}$ | Squared component sum | 10.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{c}) = \boldsymbol{X} (\boldsymbol{b} \boldsymbol{c}^\top + \boldsymbol{c} \boldsymbol{b}^\top)$ | Gram matrix bilinear form | 10.3 |
| \begin{align}&\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{B}\boldsymbol{x}+\boldsymbol{b})^\top \boldsymbol{C} (\boldsymbol{D}\boldsymbol{x}+\boldsymbol{d}) \notag\\&= \boldsymbol{B}^\top \boldsymbol{C} (\boldsymbol{D}\boldsymbol{x}+\boldsymbol{d}) + \boldsymbol{D}^\top \boldsymbol{C}^\top (\boldsymbol{B}\boldsymbol{x}+\boldsymbol{b}) \notag\end{align} | General quadratic form | 10.4 |
| $\displaystyle\frac{\partial (\boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X})}{\partial X_{ij}} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} \boldsymbol{B} \boldsymbol{X}$ | Matrix quadratic form component | 10.5 |
| $\displaystyle\frac{\partial \boldsymbol{x}^\top \boldsymbol{B} \boldsymbol{x}}{\partial \boldsymbol{x}} = (\boldsymbol{B} + \boldsymbol{B}^\top)\boldsymbol{x}$ | Vector quadratic form | 10.6 |
| $\displaystyle\frac{\partial \boldsymbol{b}^\top \boldsymbol{X}^\top \boldsymbol{D} \boldsymbol{X} \boldsymbol{c}}{\partial \boldsymbol{X}} = \boldsymbol{D}^\top \boldsymbol{X} \boldsymbol{b} \boldsymbol{c}^\top + \boldsymbol{D} \boldsymbol{X} \boldsymbol{c} \boldsymbol{b}^\top$ | Generalized Gram bilinear form | 10.7 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c})^\top \boldsymbol{D} (\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c}) = (\boldsymbol{D} + \boldsymbol{D}^\top)(\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c})\boldsymbol{b}^\top$ | Affine quadratic form | 10.8 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x} - \boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{s}) = 2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{s})$ ($\boldsymbol{W}$: symmetric) | Symmetric quadratic ($\boldsymbol{x}$ deriv.) | 10.9 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{s}} (\boldsymbol{x} - \boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{s}) = -2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{s})$ ($\boldsymbol{W}$: symmetric) | Symmetric quadratic ($\boldsymbol{s}$ deriv.) | 10.10 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = 2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})$ ($\boldsymbol{W}$: symmetric) | Affine symmetric ($\boldsymbol{x}$ deriv.) | 10.11 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{s}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = -2\boldsymbol{A}^\top\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})$ ($\boldsymbol{W}$: symmetric) | Affine symmetric ($\boldsymbol{s}$ deriv.) | 10.12 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{A}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = -2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})\boldsymbol{s}^\top$ ($\boldsymbol{W}$: symmetric) | Affine symmetric ($\boldsymbol{A}$ deriv.) | 10.13 |
11. Matrix Powers and Composite Functions
Derivative formulas for functions involving matrix powers $\boldsymbol{X}^n$, composite functions, and the Rayleigh quotient. See Proof Collection, Chapter 11 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial (\boldsymbol{X}^n)_{kl}}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r})_{kl}$ | Matrix power component | 11.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r)^\top \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top$ | Bilinear form of power | 11.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r)^\top \boldsymbol{X}^n (\boldsymbol{b}\boldsymbol{a}^\top + \boldsymbol{a}\boldsymbol{b}^\top) (\boldsymbol{X}^{n-1-r})^\top$ | Gram form of power | 11.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \boldsymbol{s}(\boldsymbol{x})^\top \boldsymbol{A} \boldsymbol{r}(\boldsymbol{x}) = \displaystyle\left[\displaystyle\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A} \boldsymbol{r} + \left[\displaystyle\frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A}^\top \boldsymbol{s}$ | Composite bilinear form | 11.4 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \displaystyle\frac{(\boldsymbol{A}\boldsymbol{x})^\top (\boldsymbol{A}\boldsymbol{x})}{(\boldsymbol{B}\boldsymbol{x})^\top (\boldsymbol{B}\boldsymbol{x})} = \displaystyle 2\displaystyle\frac{\boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{B}^\top\boldsymbol{B}\boldsymbol{x}} - 2\displaystyle\frac{\boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x} \cdot \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}}{(\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x})^2}$ | Rayleigh quotient | 11.5 |
| $\nabla_{\boldsymbol{x}} f = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{x} + \boldsymbol{b}$ ($f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}^\top \boldsymbol{x}$) | Gradient | 11.6a |
| $\displaystyle\frac{\partial^2 f}{\partial \boldsymbol{x} \partial \boldsymbol{x}^\top} = \boldsymbol{A} + \boldsymbol{A}^\top$ ($f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}^\top \boldsymbol{x}$) | Hessian | 11.6b |
11.1 Matrix Exponential Derivatives
Derivatives of the matrix exponential $e^{\boldsymbol{A}} = \sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{A}^k}{k!}$ (Fréchet derivative). Important in Lie groups/algebras, differential equations, and control theory. Numerical stability is measured by the condition number $\kappa(e^{\boldsymbol{A}}) = \|L(\boldsymbol{A}, \cdot)\|$ ($L$ is the Fréchet derivative operator) (11.10).
| Formula | Notes | Proof |
|---|---|---|
| $D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds$ (Fréchet derivative in direction $\boldsymbol{E}$) | Fréchet derivative of matrix exponential | 11.7 |
| $\displaystyle\frac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} e^{t\boldsymbol{A}} = e^{t\boldsymbol{A}} \boldsymbol{A}$ | Scalar parameter derivative | 11.8 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{A}} \text{tr}(e^{\boldsymbol{A}}) = (e^{\boldsymbol{A}})^\top$ | Trace of matrix exponential | 11.9 |
11.2 Matrix Square Root Gradient
Gradient of the matrix square root $\boldsymbol{A}^{1/2}$ of a positive definite matrix $\boldsymbol{A}$ (satisfying $\boldsymbol{A} = \boldsymbol{A}^{1/2}\boldsymbol{A}^{1/2}$).
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial L}{\partial \boldsymbol{A}}$ ($\boldsymbol{S} = \boldsymbol{A}^{1/2}$) Solution of Sylvester equation $\boldsymbol{S}\boldsymbol{X} + \boldsymbol{X}\boldsymbol{S} = \bar{\boldsymbol{S}}$ | $\bar{\boldsymbol{S}}$: upstream gradient $\displaystyle\frac{\partial L}{\partial \boldsymbol{A}} = \boldsymbol{X}$ | 11.10 |
12. Norm Derivatives
Derivative formulas for vector and matrix norms. See Proof Collection, Chapter 12 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2}$ | 2-norm derivative | 12.1 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \displaystyle\frac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \displaystyle\frac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3}$ | Normalized vector derivative | 12.2 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x}$ | Squared 2-norm | 12.3 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$ | Squared Frobenius norm | 12.4 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \displaystyle\frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F}$ | Frobenius norm | 12.5 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A})$ | Squared Frobenius of difference | 12.6 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B})$ | Regression residual (left multiply) | 12.7 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top$ | Regression residual (right multiply) | 12.8 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$ | Regression weight gradient | 12.9 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{W}} \displaystyle\frac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W}$ | L2 regularization (weight decay) | 12.10 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \text{sign}(\boldsymbol{W})$ | L1 regularization (subgradient) $[-1, 1]$ at $W_{ij} = 0$ | 12.11 |
| LASSO gradient $\displaystyle\frac{\partial}{\partial \boldsymbol{\alpha}}\left(\displaystyle\frac{1}{2}\|\boldsymbol{x} - \boldsymbol{D}\boldsymbol{\alpha}\|^2 + \lambda\|\boldsymbol{\alpha}\|_1\right) = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) + \lambda \cdot \text{sign}(\boldsymbol{\alpha})$ | L1-regularized regression Subgradient | 12.12 |
13. Structured Matrix Derivatives
Derivative formulas for matrices with structure such as symmetric, diagonal, and Toeplitz matrices. See Proof Collection, Chapter 13 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{df}{dA_{ij}} = \displaystyle\text{tr}\left[\left(\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\right)^\top \boldsymbol{S}^{ij}\right]$ ($\boldsymbol{A}$: structured matrix) | Structured matrix derivative (general) | 13.1 |
| $\displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij}$ ($\boldsymbol{A}$: general matrix) | Structured matrix (general) | 13.2 |
| $\displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij}$ ($\boldsymbol{A}$: symmetric) | Structured matrix (symmetric) | 13.3 |
| \begin{align}&\frac{\partial f}{\partial \boldsymbol{A}} = \frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}} = \frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}} + \left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right)^\top \notag\\&\quad - \text{diag}\left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right) \notag\end{align}(symmetric $\boldsymbol{A}$) | Derivative w.r.t. symmetric matrix | 13.4 |
Here $\boldsymbol{S}^{ij} = \displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}}$ is the structure matrix, representing how the entire matrix changes when $A_{ij}$ is varied.
13.1 Vec Operator and Related Matrices
The vec operator converts a matrix to a column vector, along with the commutation matrix and duplication matrix. These are fundamental tools for treating matrix derivatives as linear transformations.
| Formula | Notes | Proof |
|---|---|---|
| $\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$ | Vectorization of matrix product | 13.5 |
| $\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}^\top)$ ($\boldsymbol{A}$: $m \times n$) | Commutation matrix | 13.6 |
| $\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq}$ ($\boldsymbol{A}$: $m \times p$, $\boldsymbol{B}$: $n \times q$) | Kronecker product reordering | 13.7 |
| $\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A})$ ($\boldsymbol{A}$: $n \times n$ symmetric) | Duplication matrix | 13.8 |
| $\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A})$ ($\boldsymbol{A}$: $n \times n$ symmetric) | Elimination matrix | 13.9 |
| $\boldsymbol{L}_n \boldsymbol{D}_n = \boldsymbol{I}_{n(n+1)/2}$ | Elimination-duplication relation | 13.10 |
| $\displaystyle\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = \boldsymbol{I}_{mn}$ ($\boldsymbol{X}$: $m \times n$) | Vectorization derivative | 13.11 |
Here $\text{vec}(\boldsymbol{A})$ stacks the columns of $\boldsymbol{A}$ into a single vector, $\text{vech}(\boldsymbol{A})$ vectorizes the lower triangular part (including diagonal) of a symmetric matrix, and $\otimes$ denotes the Kronecker product.
13.2 Cholesky Decomposition Gradient
Gradient of the Cholesky decomposition $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$ ($\boldsymbol{L}$ lower triangular) for positive definite $\boldsymbol{A}$. Important for Gaussian processes and covariance matrix computations.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial L}{\partial \boldsymbol{A}}$ ($\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$) $\boldsymbol{L}^{-\top}\text{tril}(\boldsymbol{L}^\top \bar{\boldsymbol{L}})\boldsymbol{L}^{-1}$ | $\bar{\boldsymbol{L}}$: upstream gradient $\text{tril}$: lower triangular part | 13.12 |
| $\displaystyle\frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{A}}$ (via Cholesky) $\boldsymbol{A}^{-\top}$ | $\log|\boldsymbol{A}| = 2\sum_i \log L_{ii}$ | 13.13 |
14. Matrix Chain Rule
When a matrix $\boldsymbol{U} = f(\boldsymbol{X})$ is a function of matrix $\boldsymbol{X}$, and there is a scalar function $g(\boldsymbol{U})$, these are the composite function derivative formulas. See Proof Collection, Chapter 14 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial g(\boldsymbol{U})}{\partial X_{ij}} = \displaystyle\sum_{k,l} \displaystyle\frac{\partial g}{\partial U_{kl}} \displaystyle\frac{\partial U_{kl}}{\partial X_{ij}}$ ($\boldsymbol{U} = f(\boldsymbol{X})$) | Matrix chain rule (component form) | 14.1 |
| $\displaystyle\frac{\partial g(\boldsymbol{U})}{\partial X_{ij}} = \displaystyle\text{tr}\left[\left(\displaystyle\frac{\partial g}{\partial \boldsymbol{U}}\right)^\top \displaystyle\frac{\partial \boldsymbol{U}}{\partial X_{ij}}\right]$ ($\boldsymbol{U} = f(\boldsymbol{X})$) | Matrix chain rule (trace form) | 14.2 |
15. Special Matrix Derivatives
Specific derivative formulas for symmetric, diagonal, and Toeplitz matrices. See Proof Collection, Chapter 15 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}} = \boldsymbol{A} + \boldsymbol{A}^\top - (\boldsymbol{A} \circ \boldsymbol{I})$ ($\boldsymbol{X}$: symmetric) | Symmetric matrix trace derivative | 15.1 |
| $\displaystyle\frac{\partial |\boldsymbol{X}|}{\partial \boldsymbol{X}} = |\boldsymbol{X}|(2\boldsymbol{X}^{-1} - (\boldsymbol{X}^{-1} \circ \boldsymbol{I}))$ ($\boldsymbol{X}$: symmetric) | Symmetric matrix determinant derivative | 15.2 |
| $\displaystyle\frac{\partial \log|\boldsymbol{X}|}{\partial \boldsymbol{X}} = 2\boldsymbol{X}^{-1} - (\boldsymbol{X}^{-1} \circ \boldsymbol{I})$ ($\boldsymbol{X}$: symmetric) | Symmetric matrix log-det derivative | 15.3 |
| $\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}} = \boldsymbol{A} \circ \boldsymbol{I}$ ($\boldsymbol{X}$: diagonal) | Diagonal matrix trace derivative | 15.4 |
| $\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{T})}{\partial \boldsymbol{T}} = \boldsymbol{\alpha}(\boldsymbol{A})$ ($\boldsymbol{T}$: Toeplitz) | Toeplitz matrix trace derivative | 15.5 |
| $\displaystyle\frac{\partial c(\boldsymbol{A})}{\partial \boldsymbol{A}} = \displaystyle\frac{1}{\lambda_{\min}}\boldsymbol{v}_{\max}\boldsymbol{v}_{\max}^\top - \displaystyle\frac{c(\boldsymbol{A})}{\lambda_{\min}}\boldsymbol{v}_{\min}\boldsymbol{v}_{\min}^\top$ ($\boldsymbol{A}$: symmetric positive definite) | Condition number derivative | 15.6 |
Here $\boldsymbol{A} \circ \boldsymbol{I}$ is the Hadamard product that retains only diagonal elements, $\boldsymbol{\alpha}(\boldsymbol{A})$ is the matrix whose components are the diagonal sums of $\boldsymbol{A}^\top$, and $c(\boldsymbol{A}) = \lambda_{\max}/\lambda_{\min}$ is the condition number.
16. Complex Matrix Derivatives
Wirtinger derivatives for functions involving complex conjugates, and derivative formulas for complex traces. See Proof Collection, Chapter 16 for proofs.
| Formula | Notes | Proof |
|---|---|---|
| $\displaystyle\frac{\partial f}{\partial z}$, $\displaystyle\frac{\partial f}{\partial z^*} = \displaystyle\frac{1}{2}\left(\displaystyle\frac{\partial f}{\partial \Re z} \mp i\displaystyle\frac{\partial f}{\partial \Im z}\right)$ | Wirtinger derivative | 16.1 |
| $\nabla f(\boldsymbol{z}) = \displaystyle 2\displaystyle\frac{\partial f(\boldsymbol{z})}{\partial \boldsymbol{z}^*}$ ($f$: real-valued) | Complex gradient vector | 16.2 |
| $\displaystyle\frac{\partial g}{\partial z} = \displaystyle\frac{\partial g}{\partial f}\displaystyle\frac{\partial f}{\partial z} + \displaystyle\frac{\partial g}{\partial f^*}\displaystyle\frac{\partial f^*}{\partial z}$ (composite function) | Complex chain rule | 16.3 |
| $\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}^*)}{\partial \Re\boldsymbol{X}} = \boldsymbol{I}$ | Complex conjugate trace derivative | 16.4 |
| $\displaystyle\frac{\partial \text{Tr}(\boldsymbol{A}\boldsymbol{X}^H)}{\partial \Re\boldsymbol{X}} = \boldsymbol{A}$ | Hermitian trace derivative | 16.6 |
| $\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}\boldsymbol{X}^H)}{\partial \Re\boldsymbol{X}} = 2\Re\boldsymbol{X}$ | Frobenius norm derivative | 16.8 |
| $\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}\boldsymbol{X}^H)}{\partial \boldsymbol{X}} = \boldsymbol{X}^*$ | Wirtinger derivative | 16.9 |
| $\nabla\|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$ | Complex Frobenius norm gradient | 16.10 |
| $\displaystyle\frac{\partial \det(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}^*} = \det(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})\boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})^{-1}$ | Complex determinant derivative | 16.11 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}}\displaystyle\frac{(\boldsymbol{A}\boldsymbol{x})^H(\boldsymbol{A}\boldsymbol{x})}{(\boldsymbol{B}\boldsymbol{x})^H(\boldsymbol{B}\boldsymbol{x})}$ (complex Rayleigh quotient) | Complex Rayleigh quotient derivative | 16.12 |
| $\displaystyle\frac{\partial (a - \boldsymbol{x}^H \boldsymbol{b})^2}{\partial \boldsymbol{x}} = -2\bar{\boldsymbol{b}}(a - \boldsymbol{x}^H \boldsymbol{b})^*$ | Complex quadratic form derivative | 16.13 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{w}^H\boldsymbol{z}) = \boldsymbol{0}$, $\quad\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{z}^H\boldsymbol{w}) = \boldsymbol{w}$ | Inner product Wirtinger derivative | 16.57 |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{z}^H\boldsymbol{A}\boldsymbol{z}) = \boldsymbol{A}\boldsymbol{z}$ ($\boldsymbol{A}$: Hermitian) | Hermitian quadratic form Wirtinger derivative | 16.57 |
Here $\boldsymbol{X}^H = (\boldsymbol{X}^*)^\top$ is the Hermitian transpose, $\boldsymbol{X}^*$ is the element-wise complex conjugate, and $\bar{\boldsymbol{b}}$ is the complex conjugate of $\boldsymbol{b}$.
Proofs
Detailed proofs for all formulas can be found in the Matrix Calculus Proof Collection.
Appendix A. Correspondence with Numerator Layout
A.1 Shape of the Gradient Vector
For the gradient of a scalar $f$ with respect to a vector $\boldsymbol{x} \in \mathbb{R}^n$:
- Numerator layout: $\nabla f = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^{1 \times n}$ (row vector)
- Denominator layout (this document): $\nabla f = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^{n \times 1}$ (column vector)
In optimization, when "moving in the gradient direction," the denominator layout allows directly adding $-\nabla f$, while the numerator layout requires the transpose $(\nabla f)^T$.
A.2 Jacobian Matrix Definition
For the Jacobian of a vector-valued function $\boldsymbol{f}: \mathbb{R}^n \to \mathbb{R}^m$:
- Numerator layout: $\boldsymbol{J} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \in \mathbb{R}^{m \times n}$ ($(i,j)$ entry is $\displaystyle\frac{\partial f_i}{\partial x_j}$)
- Denominator layout (this document): $\boldsymbol{J} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \in \mathbb{R}^{n \times m}$ ($(i,j)$ entry is $\displaystyle\frac{\partial f_j}{\partial x_i}$)
The two are related by transposition: $\boldsymbol{J}_{\text{denom}} = \boldsymbol{J}_{\text{numer}}^T$
A.3 Chain Rule Form
For the derivative of a composite function $\boldsymbol{g}(\boldsymbol{f}(\boldsymbol{x}))$:
- Numerator layout: $\displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}} \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}$ (multiply from left)
- Denominator layout (this document): $\displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}}$ (multiply from right)
When implementing neural network backpropagation, verify which convention is being used and set the matrix product order correctly.
A.4 Key Formula Correspondence Table
| Formula | Denominator Layout (this document) | Numerator Layout |
|---|---|---|
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{a}^T \boldsymbol{x})$ | $\boldsymbol{a}$ | $\boldsymbol{a}^T$ |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})$ | $(\boldsymbol{A} + \boldsymbol{A}^T)\boldsymbol{x}$ | $\boldsymbol{x}^T(\boldsymbol{A} + \boldsymbol{A}^T)$ |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}}(\boldsymbol{a}^T \boldsymbol{X} \boldsymbol{b})$ | $\boldsymbol{a}\boldsymbol{b}^T$ | $\boldsymbol{b}\boldsymbol{a}^T$ |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\mathrm{tr}(\boldsymbol{A}\boldsymbol{X})$ | $\boldsymbol{A}^T$ | $\boldsymbol{A}$ |
| $\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\log|\boldsymbol{X}|$ | $(\boldsymbol{X}^{-1})^T = \boldsymbol{X}^{-T}$ | $\boldsymbol{X}^{-1}$ |
When cross-referencing with other literature, first check whether the gradient vector is a row or column vector, then apply the formulas accordingly.
Applied Formulas
Applications of this formula sheet to various fields are summarized below. See the Proof Collection for detailed proofs.
Machine Learning & Information Science
Machine Learning Applications
Neural networks, deep learning, reinforcement learning, NLP
Financial Engineering Applications
Portfolio optimization, Sharpe ratio, Bordered Hessian
Natural Sciences & Engineering
Statistics Applications
Mixture models, BLUP/REML, kriging, factor analysis, SEM, IRT
Engineering Applications
Control theory, robotics, FEM, mechanics of materials, and 8 other fields
Astronomy Applications
Orbital mechanics, two-body problem, perturbation theory, aberration, redshift
Geophysics Applications
Seismic tomography, travel-time partial derivatives, sensitivity kernels
Biology Applications
Lotka-Volterra, SIR model, Wright-Fisher, phylogenetics
Pose & Rotation Applications
SO(3), quaternions, Euler angles, inertia tensor
Molecular Dynamics Applications
Lennard-Jones, harmonic oscillator, Coulomb, bond angles
References & Related Articles
Related Articles
- Matrix Calculus Proof Collection (detailed proofs for this formula sheet)
- Introduction to Matrix Calculus (basic concepts and field-specific notation)
- Introduction to Tensor Calculus (generalization of matrix calculus)
- Automatic Differentiation and Optimization (practical applications)
Key References
- Magnus, J. R. & Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd ed. Wiley. — Standard textbook on matrix calculus. Uses denominator layout.
- The Matrix Cookbook - Petersen & Pedersen (2012) — Widely referenced formula collection.
- Absil, P.-A., Mahony, R. & Sepulchre, R. (2008). Optimization Algorithms on Matrix Manifolds. Princeton University Press. — Optimization on manifolds and matrix calculus.
- Edelman, A., Arias, T. A. & Smith, S. T. (1998). The Geometry of Algorithms with Orthogonality Constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353. — Geometric foundations for optimization with orthogonality constraints.
- Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: A Survey. J. Mach. Learn. Res. 18(153), 1–43. — Comprehensive survey on automatic differentiation.
- Matrix calculus - Wikipedia
Notes
- Formal definition of the Fréchet derivative: a mapping $f: \mathbb{R}^n \to \mathbb{R}^m$ is Fréchet differentiable at $\boldsymbol{x}$ if there exists a linear mapping $Df(\boldsymbol{x}): \mathbb{R}^n \to \mathbb{R}^m$ such that $\displaystyle\lim_{\|\boldsymbol{h}\| \to 0} \frac{\|f(\boldsymbol{x}+\boldsymbol{h}) - f(\boldsymbol{x}) - Df(\boldsymbol{x})[\boldsymbol{h}]\|}{\|\boldsymbol{h}\|} = 0$. ←