Vector/Matrix Calculus Formula Sheet

Matrix Calculus Formulas

This document is a collection of formulas for multivariable function derivatives (vector calculus, matrix calculus) used in machine learning, statistics, optimization theory, control engineering, signal processing, and econometrics.

1. Overview

Why matrix calculus?
While multivariate analysis and tensor calculus can also handle derivatives, matrix calculus offers the following advantages as an independent framework:
  • Index-free notation (index-free calculus)
  • Algebraic manipulation via the vec operator and Kronecker product
  • Closed-form gradient derivation for matrix functions in machine learning and statistics
It enables more practical computation than coordinate-free tensor calculus and is widely used in applied fields.
Mathematical context
Matrix calculus lies at the intersection of analysis (multivariate differentiation), linear algebra (matrix operations), and tensor analysis (multilinear algebra). In finite-dimensional Euclidean spaces, the Gâteaux derivative and Fréchet derivative coincide, and the derivatives in this formula sheet are consistent with the Fréchet derivative[1].
Approach of this formula sheet
There are two main approaches to matrix calculus. One is the coordinate-free approach that treats derivatives tensorially without coordinates, and the other is the component-wise approach that explicitly writes out components. This formula sheet takes the latter approach, providing concrete component expressions for practical computation. From the coordinate-free perspective, matrix calculus is a special case of tensor calculus (differentiation with respect to second-order tensors). In that setting, gradients and Hessians are treated as covariant tensors. For the coordinate-free formulation, see Introduction to Tensor Calculus.
Inner product and metric assumptions
Throughout this series, the standard Euclidean inner product is assumed unless otherwise stated. This metric identifies the tangent and cotangent spaces, allowing the gradient to be treated as a vector. For the matrix space $\mathbb{R}^{m \times n}$, the Frobenius inner product $\langle \boldsymbol{A}, \boldsymbol{B} \rangle_F = \mathrm{tr}(\boldsymbol{A}^\top \boldsymbol{B})$ identifies the dual space $(\mathbb{R}^{m \times n})^*$ with $\mathbb{R}^{m \times n}$.
Scope
This series is restricted to functions on finite-dimensional Euclidean spaces $\mathbb{R}^n$. Extensions to Banach and Hilbert spaces are discussed in the Proof Collection, Chapter 1.
Definition (Matrix Derivative)
A mapping $f: \mathbb{R}^{m \times n} \to \mathbb{R}$ is differentiable at $\boldsymbol{X}$ if there exists a linear mapping $Df(\boldsymbol{X}): \mathbb{R}^{m \times n} \to \mathbb{R}$ such that $$\lim_{\|\boldsymbol{H}\|_F \to 0} \frac{|f(\boldsymbol{X}+\boldsymbol{H}) - f(\boldsymbol{X}) - Df(\boldsymbol{X})[\boldsymbol{H}]|}{\|\boldsymbol{H}\|_F} = 0$$ The matrix $\nabla f(\boldsymbol{X}) \in \mathbb{R}^{m \times n}$ satisfying $Df(\boldsymbol{X})[\boldsymbol{H}] = \langle \nabla f(\boldsymbol{X}), \boldsymbol{H} \rangle_F$ via the Frobenius inner product is called the gradient of $f$.
Figure 1. Relationship between Differential, Gradient, and Jacobian
differential Riesz isom.
Riemannian metric $\langle \cdot,\, \cdot \rangle_g$ (Euclidean: Frobenius inner product)
$f(\boldsymbol{X})$
$Df(\boldsymbol{X})$
$\in T^*_{\boldsymbol{X}} M$ (cotangent)
$\nabla f(\boldsymbol{X})$
$\in T_{\boldsymbol{X}} M$ (tangent, metric-dep.)
For vector-valued $\boldsymbol{f}\colon \mathbb{R}^n \to \mathbb{R}^m$
$Df \equiv \boldsymbol{J}_f \in \mathbb{R}^{m \times n}$ (Jacobian)
$Df(\boldsymbol{X})[\boldsymbol{H}] = \langle \nabla f(\boldsymbol{X}),\, \boldsymbol{H} \rangle_g$

The differential $Df(\boldsymbol{X})$ is a coordinate-independent linear functional (an element of the cotangent space $T_{\boldsymbol{X}}^* M$), and via the Riesz representation theorem it corresponds to the gradient $\nabla f(\boldsymbol{X})$ (an element of the tangent space $T_{\boldsymbol{X}} M$) through the inner product. The concrete form of the gradient depends on the choice of metric (inner product).

1.1 Notation and Definitions

When expressing derivatives of multivariable functions as matrices or vectors, there are two conventions: "denominator layout" and "numerator layout." This document adopts the denominator layout. For the differences between the two and field-specific conventions, see the Matrix Calculus Notation Guide.

In the denominator layout, the derivative result is defined as a matrix where "the dimension of the variable in the denominator corresponds to rows, and the dimension of the variable in the numerator corresponds to columns." This convention is widely used in machine learning, statistics, optimization theory, and econometrics, and has advantages such as the gradient vector naturally being a column vector.

International status of the denominator layout
The denominator layout is adopted in the standard textbook by Magnus & Neudecker and is mainstream in the statistics and econometrics literature. On the other hand, some engineering texts and machine learning frameworks (e.g., PyTorch's autograd) use the numerator layout. The two are related by transposition, and the formulas in this document can be converted to numerator layout by transposing. See Appendix A for details.
Symbol conventions
  • Scalars: $a, b, c, \ldots$ or $x, y, z, u, v, w$ (lowercase italic)
  • Vectors: $\boldsymbol{a}, \boldsymbol{b}, \boldsymbol{c}, \ldots$ or $\boldsymbol{x}, \boldsymbol{y}, \boldsymbol{z}$ (lowercase bold)
  • Matrices: $\boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}, \ldots$ or $\boldsymbol{X}, \boldsymbol{Y}, \boldsymbol{Z}$ (uppercase bold)
  • Logarithm: $\log$ denotes the natural logarithm (base $e$); base-$a$ logarithm is written as $\log_a$
  • Single-entry matrix: $\boldsymbol{J}^{ij}$ is the matrix with 1 at position $(i,j)$ and 0 elsewhere
  • Gradient: $\nabla f$ or $\displaystyle\frac{\partial f}{\partial \boldsymbol{x}}$ (derivative of scalar $f$ w.r.t. vector $\boldsymbol{x}$, a column vector)
  • Jacobian: $\displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}$ (derivative of vector $\boldsymbol{y}$ w.r.t. vector $\boldsymbol{x}$, a matrix)
  • Hessian: $\displaystyle\frac{\partial^2 f}{\partial \boldsymbol{x} \partial \boldsymbol{x}^\top}$ or $\boldsymbol{H}$ (second derivative of scalar $f$, a symmetric matrix)
  • Indexing: 0-based ($x_0, x_1, \ldots, x_{N-1}$ where $i = 0, \ldots, N-1$)

1.1.1 Derivative of Scalar by Vector

The derivative of a scalar $y$ with respect to an $N$-dimensional vector $\boldsymbol{x}$ is a column vector.

\begin{eqnarray} \displaystyle\frac{\partial y}{\partial \boldsymbol{x}} &\triangleq& \boldsymbol{\nabla} y = \left( \begin{array}{c} \displaystyle\frac{\partial}{\partial x_0} \\ \displaystyle\frac{\partial}{\partial x_1} \\ \displaystyle\frac{\partial}{\partial x_2} \\ \vdots \\ \displaystyle\frac{\partial}{\partial x_{N-1}} \end{array} \right) y = \left( \begin{array}{c} \displaystyle\frac{\partial y}{\partial x_0} \\ \displaystyle\frac{\partial y}{\partial x_1} \\ \displaystyle\frac{\partial y}{\partial x_2} \\ \vdots \\ \displaystyle\frac{\partial y}{\partial x_{N-1}} \end{array} \right) \end{eqnarray}
Shape of the gradient vector
In the denominator layout, the derivative of a scalar-valued function with respect to a vector is defined as a column vector. This is so that the gradient vector can be directly used as the update direction in optimization.

1.1.2 Derivative of Vector by Scalar

In the denominator layout, the derivative of a vector $\boldsymbol{y}$ (column vector) with respect to a scalar $x$ is a row vector. This follows the rule "the numerator's index determines the columns, the denominator's index determines the rows." The numerator $\boldsymbol{y}$'s index $j$ is arranged in the column direction, and since the denominator $x$ has no index, the result is a $1 \times M$ row vector.

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{y}}{\partial x} &\triangleq& \left( \begin{array}{ccccc} \displaystyle\frac{\partial y_0}{\partial x} & \displaystyle\frac{\partial y_1}{\partial x} & \displaystyle\frac{\partial y_2}{\partial x} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x} \end{array} \right) \end{eqnarray}

Here $\boldsymbol{y} = (y_0, y_1, \ldots, y_{M-1})^\top$ is an $M$-dimensional column vector. Following the denominator layout rule "denominator dimension × numerator dimension," $\partial \boldsymbol{y}/\partial x \in \mathbb{R}^{1 \times M}$ ($x$'s dimension 1 × $\boldsymbol{y}$'s dimension $M$).

1.1.3 Derivative of Vector by Vector

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} &\triangleq& \boldsymbol{\nabla} \boldsymbol{y}^\top \\ &=& \left( \begin{array}{c} \displaystyle\frac{\partial}{\partial x_0} \\ \displaystyle\frac{\partial}{\partial x_1} \\ \displaystyle\frac{\partial}{\partial x_2} \\ \vdots \\ \displaystyle\frac{\partial}{\partial x_{N-1}} \end{array} \right) \left( \begin{array}{ccccc} y_0 &y_1 &y_2 &\cdots &y_{M-1} \end{array} \right) \\ &=& \left( \begin{array}{ccccc} \displaystyle\frac{\partial y_0}{\partial x_0} & \displaystyle\frac{\partial y_1}{\partial x_0} & \displaystyle\frac{\partial y_2}{\partial x_0} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_0}\\ \displaystyle\frac{\partial y_0}{\partial x_1} & \displaystyle\frac{\partial y_1}{\partial x_1} & \displaystyle\frac{\partial y_2}{\partial x_1} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_1}\\ \displaystyle\frac{\partial y_0}{\partial x_2} & \displaystyle\frac{\partial y_1}{\partial x_2} & \displaystyle\frac{\partial y_2}{\partial x_2} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_2}\\ \vdots \\ \displaystyle\frac{\partial y_0}{\partial x_{N-1}} & \displaystyle\frac{\partial y_1}{\partial x_{N-1}} & \displaystyle\frac{\partial y_2}{\partial x_{N-1}} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_{N-1}} \end{array} \right) \label{dvfdvx} \end{eqnarray}

This is an $N \times M$ matrix, and its $(i, j)$ entry is $\displaystyle\frac{\partial y_j}{\partial x_i}$.

1.1.4 Derivative of Scalar by Matrix

The derivative of a scalar function $f(\boldsymbol{X})$ with respect to an $m \times n$ matrix $\boldsymbol{X}$ is an $m \times n$ matrix whose $(i,j)$ entry is $\displaystyle\frac{\partial f}{\partial X_{ij}}$.

\begin{eqnarray} \displaystyle\frac{\partial f}{\partial \boldsymbol{X}} &\triangleq& \left( \begin{array}{cccc} \displaystyle\frac{\partial f}{\partial X_{00}} & \displaystyle\frac{\partial f}{\partial X_{01}} & \cdots & \displaystyle\frac{\partial f}{\partial X_{0,n-1}} \\ \displaystyle\frac{\partial f}{\partial X_{10}} & \displaystyle\frac{\partial f}{\partial X_{11}} & \cdots & \displaystyle\frac{\partial f}{\partial X_{1,n-1}} \\ \vdots & \vdots & \ddots & \vdots \\ \displaystyle\frac{\partial f}{\partial X_{m-1,0}} & \displaystyle\frac{\partial f}{\partial X_{m-1,1}} & \cdots & \displaystyle\frac{\partial f}{\partial X_{m-1,n-1}} \end{array} \right) \end{eqnarray}

By this definition, the gradient matrix $\displaystyle\frac{\partial f}{\partial \boldsymbol{X}}$ has the same size as the original matrix $\boldsymbol{X}$. This is convenient for optimization algorithms like gradient descent, where $\boldsymbol{X} \leftarrow \boldsymbol{X} - \alpha \displaystyle\frac{\partial f}{\partial \boldsymbol{X}}$ can be written naturally.

1.2 Jacobian Matrix and Chain Rule

Consider the derivative of an $M$-dimensional vector-valued function $\boldsymbol{y}(\boldsymbol{x})$ with respect to an $N$-dimensional vector $\boldsymbol{x}$.

\begin{eqnarray} \boldsymbol{y} = \begin{pmatrix} y_0 \\ y_1 \\ \vdots \\ y_{M-1} \end{pmatrix}, \quad \boldsymbol{x} = \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \end{eqnarray}

1.2.1 Definition of the Jacobian

In the denominator layout, the Jacobian is an $N \times M$ matrix:

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} = \boldsymbol{J}^\top = \begin{pmatrix} \displaystyle\frac{\partial y_0}{\partial x_0} & \displaystyle\frac{\partial y_1}{\partial x_0} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_0} \\[1em] \displaystyle\frac{\partial y_0}{\partial x_1} & \displaystyle\frac{\partial y_1}{\partial x_1} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_1} \\[1em] \vdots & \vdots & \ddots & \vdots \\[0.5em] \displaystyle\frac{\partial y_0}{\partial x_{N-1}} & \displaystyle\frac{\partial y_1}{\partial x_{N-1}} & \cdots & \displaystyle\frac{\partial y_{M-1}}{\partial x_{N-1}} \end{pmatrix} \end{eqnarray}

That is, the $(i, j)$ entry is $\displaystyle\frac{\partial y_j}{\partial x_i}$. This means the "denominator" $\boldsymbol{x}$'s index determines the row, and the "numerator" $\boldsymbol{y}$'s index determines the column.

Size of the Jacobian
In the denominator layout, the Jacobian of a vector-valued function $\boldsymbol{y}: \mathbb{R}^N \to \mathbb{R}^M$ is $$\displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \in \mathbb{R}^{N \times M}$$ ("denominator dimension × numerator dimension"). This is the transpose of the Jacobian $\in \mathbb{R}^{M \times N}$ defined in the numerator layout.

1.2.2 Relation to Scalar Derivatives

When $\boldsymbol{y}$ is 1-dimensional (scalar $y$), the Jacobian reduces to an $N \times 1$ column vector, which equals the gradient:

\begin{eqnarray} \displaystyle\frac{\partial y}{\partial \boldsymbol{x}} = \nabla y = \left( \begin{array}{c} \displaystyle\frac{\partial y}{\partial x_0} \\ \displaystyle\frac{\partial y}{\partial x_1} \\ \vdots \\ \displaystyle\frac{\partial y}{\partial x_{N-1}} \end{array} \right) \end{eqnarray}

1.2.3 Chain Rule

Consider the derivative of a composite function $\boldsymbol{z}(\boldsymbol{y}(\boldsymbol{x}))$. Here $\boldsymbol{x}$ is $N$-dimensional, $\boldsymbol{y}$ is $M$-dimensional, and $\boldsymbol{z}$ is $L$-dimensional.

1.2.3.1 Vector Chain Rule

Differentiating the $l$-th component $z_l$ of $\boldsymbol{z}$ with respect to the $i$-th component $x_i$ of $\boldsymbol{x}$, the usual multivariate chain rule gives:

\begin{eqnarray} \displaystyle\frac{\partial z_l}{\partial x_i} &=& \displaystyle\sum_{m=0}^{M-1} \displaystyle\frac{\partial z_l}{\partial y_m} \displaystyle\frac{\partial y_m}{\partial x_i} \end{eqnarray}

In the denominator layout, $\displaystyle\left(\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right)_{il} = \displaystyle\frac{\partial z_l}{\partial x_i}$, so:

\begin{eqnarray} \left(\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}}\right)_{il} &=& \displaystyle\sum_{m=0}^{M-1} \displaystyle\frac{\partial y_m}{\partial x_i} \displaystyle\frac{\partial z_l}{\partial y_m} \\ &=& \displaystyle\sum_{m=0}^{M-1} \left(\displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}}\right)_{im} \left(\displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}}\right)_{ml} \end{eqnarray}

This is precisely the definition of matrix multiplication:

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{x}} &=& \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \displaystyle\frac{\partial \boldsymbol{z}}{\partial \boldsymbol{y}} \end{eqnarray}

The dimensions are $(N \times M) \cdot (M \times L) = N \times L$, as expected.

Order of multiplication in the chain rule
In the denominator layout, the rows and columns of the Jacobian correspond to the denominator and numerator dimensions, respectively. Therefore, the order of matrix multiplication in the chain rule is uniquely determined by dimensional consistency. Note that reversing the order makes the product undefined.
1.2.3.2 Scalar Output Case

When $z$ is a scalar ($L = 1$), $\displaystyle\frac{\partial z}{\partial \boldsymbol{y}}$ is an $M \times 1$ column vector (gradient):

\begin{eqnarray} \displaystyle\frac{\partial z}{\partial \boldsymbol{x}} &=& \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} \displaystyle\frac{\partial z}{\partial \boldsymbol{y}} \end{eqnarray}

The dimensions are $(N \times M) \cdot (M \times 1) = N \times 1$, yielding the gradient with respect to $\boldsymbol{x}$.

1.2.3.3 Element-wise Functions

Let $f$ be a scalar function and consider $\boldsymbol{y} = (f(u_0), f(u_1), \ldots, f(u_{M-1}))^\top$ where $f$ is applied to each element of $\boldsymbol{u}$. Since $y_j = f(u_j)$:

\begin{eqnarray} \displaystyle\frac{\partial y_j}{\partial u_k} &=& \begin{cases} f'(u_j), & j = k \\ 0, & j \neq k \end{cases} \end{eqnarray}

Therefore:

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{u}} &=& \text{diag}(f'(u_0), f'(u_1), \ldots, f'(u_{M-1})) \end{eqnarray}

Combined with the chain rule:

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{x}} &=& \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \text{diag}(f'(u_0), f'(u_1), \ldots, f'(u_{M-1})) \end{eqnarray}

2. Derivative of Scalar by Vector

Formulas for differentiating a scalar function $f$ with respect to a vector $\boldsymbol{x}$. Here $a$ is a scalar constant, $\boldsymbol{a}, \boldsymbol{b}$ are constant vectors, and $\boldsymbol{A}$ is a constant matrix. See Proof Collection, Chapter 2 for proofs.

Formula Notes Proof
$\displaystyle\frac{\partial a}{\partial \boldsymbol{x}} = \boldsymbol{0}$ $a$ is constant 2.1
$\displaystyle\frac{\partial (\boldsymbol{a}^\top \boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{a}$ 2.2
$\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{a})}{\partial \boldsymbol{x}} = \boldsymbol{a}$ 2.2
$\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\boldsymbol{x}$ 2.3
$\displaystyle\frac{\partial (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top \boldsymbol{b}$ Bilinear form 2.4
$\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x}$ Quadratic form 2.5
$\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\boldsymbol{A} \boldsymbol{x}$ $\boldsymbol{A}$ symmetric 2.5
$\displaystyle\frac{\partial \|\boldsymbol{x} - \boldsymbol{a}\|}{\partial \boldsymbol{x}} = \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|}$ 2-norm 2.6
$\displaystyle\frac{\partial \|\boldsymbol{x} - \boldsymbol{a}\|^2}{\partial \boldsymbol{x}} = 2(\boldsymbol{x} - \boldsymbol{a})$ Squared 2-norm 2.7
$\displaystyle\frac{\partial (uv)}{\partial \boldsymbol{x}} = \displaystyle u \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} + v \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ Product rule 2.9
$\displaystyle\frac{\partial (\boldsymbol{u}^\top \boldsymbol{v})}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \boldsymbol{v} + \displaystyle\frac{\partial \boldsymbol{v}}{\partial \boldsymbol{x}} \boldsymbol{u}$ Inner product rule 2.8
$\displaystyle\frac{\partial (f + g)}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial g}{\partial \boldsymbol{x}}$ Sum rule 2.10
$\displaystyle\frac{\partial (cf)}{\partial \boldsymbol{x}} = \displaystyle c \displaystyle\frac{\partial f}{\partial \boldsymbol{x}}$ Scalar multiplication 2.11
$\displaystyle\frac{\partial (u/v)}{\partial \boldsymbol{x}} = \displaystyle\frac{1}{v^2}\left( v \displaystyle\frac{\partial u}{\partial \boldsymbol{x}} - u \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} \right)$ Quotient rule 2.12
$\displaystyle\frac{\partial (1/u)}{\partial \boldsymbol{x}} = \displaystyle -\displaystyle\frac{1}{u^2} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ Reciprocal 2.13
$\displaystyle\frac{\partial u^n}{\partial \boldsymbol{x}} = \displaystyle n u^{n-1} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ Power rule 2.14
$\displaystyle\frac{\partial e^u}{\partial \boldsymbol{x}} = \displaystyle e^u \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ Exponential 2.15
$\displaystyle\frac{\partial \log u}{\partial \boldsymbol{x}} = \displaystyle \displaystyle\frac{1}{u} \displaystyle\frac{\partial u}{\partial \boldsymbol{x}}$ Logarithm 2.16

3. Derivative of Vector by Vector

Formulas for differentiating a vector function $\boldsymbol{y}$ with respect to a vector $\boldsymbol{x}$. The result is a Jacobian matrix ($N \times M$). See Proof Collection, Chapter 3 for proofs.

Formula Notes Proof
$\displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{x}} = \boldsymbol{I}$ Identity 3.1
$\displaystyle\frac{\partial \boldsymbol{a}}{\partial \boldsymbol{x}} = \boldsymbol{O}$ Constant vector 3.3
$\displaystyle\frac{\partial (\boldsymbol{A}\boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top$ Linear transformation 3.2
$\displaystyle\frac{\partial (\boldsymbol{A}\boldsymbol{x} + \boldsymbol{b})}{\partial \boldsymbol{x}} = \boldsymbol{A}^\top$ Affine transformation 3.4
$\displaystyle\frac{\partial (\boldsymbol{x}^\top \boldsymbol{A})}{\partial \boldsymbol{x}} = \boldsymbol{A}$ Transposed linear 3.5
$\displaystyle\frac{\partial (\boldsymbol{u} + \boldsymbol{v})}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial \boldsymbol{v}}{\partial \boldsymbol{x}}$ Sum rule 3.6
$\displaystyle\frac{\partial (v \boldsymbol{u})}{\partial \boldsymbol{x}} = \displaystyle v \displaystyle\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} + \displaystyle\frac{\partial v}{\partial \boldsymbol{x}} \boldsymbol{u}^\top$ Product rule (scalar × vector) 3.7
$\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{x})}{\partial \boldsymbol{x}} = 2\text{diag}(\boldsymbol{x})$ Element-wise square 3.8
$\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{y})}{\partial \boldsymbol{z}} = \displaystyle\text{diag}(\boldsymbol{x}) \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{z}} + \text{diag}(\boldsymbol{y}) \displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{z}}$ Hadamard product
$\boldsymbol{x} = \boldsymbol{x}(\boldsymbol{z}),\ \boldsymbol{y} = \boldsymbol{y}(\boldsymbol{z})$
3.12
$\displaystyle\frac{d(\boldsymbol{x} \times \boldsymbol{y})}{dt} = \displaystyle\frac{d\boldsymbol{x}}{dt} \times \boldsymbol{y} + \boldsymbol{x} \times \displaystyle\frac{d\boldsymbol{y}}{dt}$ Time derivative of cross product 3.9
$\displaystyle\frac{d\|\boldsymbol{x}(t)\|}{dt} = \displaystyle\frac{\boldsymbol{x}}{\|\boldsymbol{x}\|} \cdot \displaystyle\frac{d\boldsymbol{x}}{dt}$ Time derivative of 2-norm 3.10
$\displaystyle\frac{\partial}{\partial \boldsymbol{u}} \begin{pmatrix} f(u_0) \\ \vdots \\ f(u_{N-1}) \end{pmatrix} = \text{diag}\begin{pmatrix} f'(u_0) \\ \vdots \\ f'(u_{N-1}) \end{pmatrix}$ Element-wise function 3.11
$\displaystyle\frac{\partial \text{softmax}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top$ Softmax ($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$) 3.13

4. Basic Matrix Derivative Formulas

Basic formulas for differentiating a scalar function $f(\boldsymbol{X})$ with respect to a matrix $\boldsymbol{X}$. Here $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{a}, \boldsymbol{b}$ are constant vectors. See Proof Collection, Chapter 4 for proofs.

Formula Notes Proof
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \boldsymbol{a} \boldsymbol{b}^\top$ Bilinear form 4.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b}) = \boldsymbol{b} \boldsymbol{a}^\top$ Transposed bilinear form 4.2
$\displaystyle\frac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij}$ Component derivative 4.3
$\displaystyle\frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} = (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij}$ Product component derivative 4.4
$\displaystyle\frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} = (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij}$ Transposed product component 4.5

5. Trace Derivatives

Derivative formulas for scalar functions involving the trace $\text{tr}(\cdot)$. See Proof Collection, Chapter 5 for proofs.

Formula Notes Proof
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I}$Trace5.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top$Trace5.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{A}^\top$Trace5.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}$Trace5.4
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}$Trace5.5
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$Trace5.6
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top$Quadratic trace5.7
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$Quadratic form trace5.8
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$Quadratic form trace5.9
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$Quadratic form trace5.10
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$Quadratic form trace5.11
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$Quadratic form trace5.12
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$Quadratic form trace5.13
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top + \boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top$Quadratic form trace5.14
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = 2\boldsymbol{X}$Frobenius norm5.15
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$Frobenius norm5.16
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$Quadratic form trace5.17
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \boldsymbol{B}\boldsymbol{X}\boldsymbol{C} + \boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top$Quadratic form trace5.18
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$Quadratic form trace5.19
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})^\top] = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})\boldsymbol{B}^\top$Quadratic form trace5.20
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = 2\text{tr}(\boldsymbol{X})\boldsymbol{I}$Kronecker product trace5.21
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{X}$Quadratic form trace5.22
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{A}^\top \boldsymbol{B}^\top$Product trace5.23
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \boldsymbol{B}\boldsymbol{A}$Transposed product trace5.24
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A})\boldsymbol{I}$Kronecker product trace5.25
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A}) = -\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top}$Inverse trace5.26
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$Higher-order trace5.27
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$Higher-order trace5.28
\begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) \notag\\&= \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X} \notag\\&\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X} + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top \notag\end{align}Higher-order quadratic form5.29
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$Inverse trace5.30
\begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] \notag\\&= -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{C}$: symmetric)Quadratic form inverse trace5.31
\begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] \notag\\&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\\&\quad +2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{B}, \boldsymbol{C}$: symmetric)Quadratic form inverse trace5.32
\begin{align}&\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] \notag\\&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X} \notag\\&\qquad \times(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} +2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag\end{align}($\boldsymbol{B}, \boldsymbol{C}$: symmetric)Quadratic form inverse trace5.33
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X})) = \exp(\boldsymbol{X})^\top$Exponential trace5.34
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X})) = \boldsymbol{X}^{-\top}$Logarithm trace5.35
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sqrt{\boldsymbol{X}}) = \displaystyle\frac{1}{2}(\boldsymbol{X}^{-1/2})^\top$
($\boldsymbol{X}$: positive definite)
Square root trace5.36
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X})) = \cos(\boldsymbol{X})^\top$Sine trace5.37
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X})) = -\sin(\boldsymbol{X})^\top$Cosine trace5.38
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top$Tangent trace5.39
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$Arcsine trace5.40
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$Arccosine trace5.41
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$Arctangent trace5.42
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X})) = \cosh(\boldsymbol{X})^\top$Hyperbolic sine trace5.43
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X})) = \sinh(\boldsymbol{X})^\top$Hyperbolic cosine trace5.44
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top$Hyperbolic tangent trace5.45
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top$Inverse hyperbolic sine trace5.46
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top$Inverse hyperbolic cosine trace5.47
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top$Inverse hyperbolic tangent trace5.48
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) = (\boldsymbol{A}\exp(\boldsymbol{X}))^\top$Matrix-coefficient exponential trace5.50
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) = (\boldsymbol{A}\cos(\boldsymbol{X}))^\top$Matrix-coefficient sine trace5.49
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) = -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top$Matrix-coefficient cosine trace5.51
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X})) = (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top$Matrix-coefficient tangent trace5.52
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$Matrix-coefficient arcsine trace5.53
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X})) = -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$Matrix-coefficient arccosine trace5.54
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$Matrix-coefficient arctangent trace5.55
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) = (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top$Matrix-coefficient hyp. sine trace5.56
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) = (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top$Matrix-coefficient hyp. cosine trace5.57
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\text{sech}^2(\boldsymbol{X}))^\top$Matrix-coefficient hyp. tangent trace5.58

6. Hadamard Product and Activation Functions

Derivative formulas for the element-wise (Hadamard) product and activation functions commonly used in machine learning. See Proof Collection, Chapter 6 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial (\boldsymbol{x} \odot \boldsymbol{y})}{\partial \boldsymbol{z}} = \text{diag}(\boldsymbol{x}) \displaystyle\frac{\partial \boldsymbol{y}}{\partial \boldsymbol{z}} + \text{diag}(\boldsymbol{y}) \displaystyle\frac{\partial \boldsymbol{x}}{\partial \boldsymbol{z}}$Hadamard product6.1
$\displaystyle\frac{\partial \text{softmax}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top$
($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$)
Softmax Jacobian6.2
$\displaystyle\frac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$
($\sigma(x) = \displaystyle\frac{1}{1+e^{-x}}$)
Sigmoid6.3
$\displaystyle\frac{\partial \tanh(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(1 - \tanh^2(\boldsymbol{x}))$Tanh6.4
$\displaystyle\frac{\partial \text{ReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\mathbf{1}_{x_i > 0})$
($\text{ReLU}(x) = \max(0, x)$)
ReLU (subgradient at $x = 0$)6.5
$\displaystyle\frac{\partial \text{LeakyReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\mathbf{1}_{x_i > 0} + \alpha \cdot \mathbf{1}_{x_i \leq 0})$
($\text{LeakyReLU}(x) = \max(\alpha x, x)$)
Leaky ReLU6.6
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (-\boldsymbol{y}^\top \log \boldsymbol{p}) = \boldsymbol{p} - \boldsymbol{y}$
($\boldsymbol{p} = \text{softmax}(\boldsymbol{x})$)
Cross-entropy loss
(softmax + CE)
6.7
$\displaystyle\frac{\partial}{\partial x} \text{BCE}(y, \sigma(x)) = \sigma(x) - y$
(BCE = $-y\log\sigma(x) - (1-y)\log(1-\sigma(x))$)
Binary cross-entropy
(sigmoid + BCE)
6.8
$\displaystyle\frac{\partial \text{GELU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\Phi(\boldsymbol{x}) + \boldsymbol{x} \odot \phi(\boldsymbol{x}))$
($\text{GELU}(x) = x \cdot \Phi(x)$)
GELU
$\Phi$: standard normal CDF
$\phi$: standard normal PDF
6.9
$\displaystyle\frac{\partial \text{Swish}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \text{diag}(\sigma(\boldsymbol{x}) + \boldsymbol{x} \odot \sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$
($\text{Swish}(x) = x \cdot \sigma(x)$)
Swish (SiLU)6.10

7. Determinant Derivatives

Derivative formulas for the determinant $\det(\boldsymbol{X})$ and related functions. See Proof Collection, Chapter 7 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}| = |\boldsymbol{X}| \boldsymbol{X}^{-\top}$Determinant7.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log|\boldsymbol{X}| = \boldsymbol{X}^{-\top}$Log-determinant7.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^n| = n|\boldsymbol{X}^n| \boldsymbol{X}^{-\top}$Power of determinant7.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}| = |\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}| \boldsymbol{X}^{-\top}$
($\boldsymbol{A}, \boldsymbol{B}$: square invertible)
Product determinant7.7
$\displaystyle\sum_{k} \displaystyle\frac{\partial |\boldsymbol{X}|}{\partial X_{ik}} X_{jk} = \delta_{ij} |\boldsymbol{X}|$Cofactor expansion property7.5
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| = 2|\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| \boldsymbol{X}^{-\top}$
($\boldsymbol{X}$: square invertible)
Quadratic form determinant7.8.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| = 2|\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| \boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})^{-1}$
($\boldsymbol{X}$: non-square, $\boldsymbol{A}$: symmetric, $\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}$: invertible)
Quadratic form determinant7.8.2
\begin{align}\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}| &= |\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}|(\boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})^{-1} \notag\\&\quad + \boldsymbol{A}^\top \boldsymbol{X}(\boldsymbol{X}^\top \boldsymbol{A}^\top \boldsymbol{X})^{-1}) \notag\end{align}($\boldsymbol{X}$: non-square, $\boldsymbol{A}$: general, $\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}$: invertible)Quadratic form determinant7.8.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log |\boldsymbol{X}^\top \boldsymbol{X}| = 2(\boldsymbol{X}^{+})^\top$Gram matrix log-determinant7.9.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}^{+}} \log |\boldsymbol{X}^\top \boldsymbol{X}| = -2\boldsymbol{X}^\top$Pseudo-inverse derivative7.9.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log |\det(\boldsymbol{X})| = (\boldsymbol{X}^{-1})^\top = (\boldsymbol{X}^\top)^{-1}$Log absolute determinant7.10

8. Inverse Matrix Derivatives

Derivative formulas for functions involving the inverse $\boldsymbol{X}^{-1}$. See Proof Collection, Chapter 8 for proofs.

8.1 Regular Inverse Matrix Derivatives

FormulaNotesProof
$\displaystyle\frac{\partial (\boldsymbol{X}^{-1})_{kl}}{\partial X_{ij}} = -(\boldsymbol{X}^{-1})_{ki}(\boldsymbol{X}^{-1})_{jl}$Inverse component derivative8.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top \boldsymbol{X}^{-1} \boldsymbol{b} = -\boldsymbol{X}^{-\top} \boldsymbol{a} \boldsymbol{b}^\top \boldsymbol{X}^{-\top}$Quadratic form with inverse8.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}^{-1}| = -|\boldsymbol{X}^{-1}|(\boldsymbol{X}^{-1})^\top$Determinant of inverse8.4
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -(\boldsymbol{X}^{-1}\boldsymbol{B}\boldsymbol{A}\boldsymbol{X}^{-1})^\top$Trace with inverse8.5
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}((\boldsymbol{X}+\boldsymbol{A})^{-1}) = -((\boldsymbol{X}+\boldsymbol{A})^{-1}(\boldsymbol{X}+\boldsymbol{A})^{-1})^\top$Trace of sum inverse8.6
$\displaystyle\frac{\partial J}{\partial \boldsymbol{A}} = -\boldsymbol{A}^{-\top} \displaystyle\frac{\partial J}{\partial \boldsymbol{W}} \boldsymbol{A}^{-\top}$
(where $\boldsymbol{W} = \boldsymbol{A}^{-1}$)
Inverse chain rule8.7
$\displaystyle\frac{\partial}{\partial A_{ij}} (\boldsymbol{I} - \boldsymbol{A})^{-1} = \boldsymbol{L} \boldsymbol{E}_{ij} \boldsymbol{L}$
($\boldsymbol{L} = (\boldsymbol{I} - \boldsymbol{A})^{-1}$: Leontief inverse)
($\boldsymbol{E}_{ij}$: matrix with 1 at $(i,j)$ only)
Leontief inverse derivative
(input-output analysis)
8.8
$\displaystyle\frac{\partial}{\partial \boldsymbol{A}} \text{tr}((\boldsymbol{I} - \boldsymbol{A})^{-1}) = ((\boldsymbol{I} - \boldsymbol{A})^{-1}(\boldsymbol{I} - \boldsymbol{A})^{-1})^\top$Leontief inverse trace8.9

8.2 Moore-Penrose Pseudoinverse Derivatives

Derivatives of the Moore-Penrose pseudoinverse $\boldsymbol{X}^+ \in \mathbb{R}^{n \times m}$ ($\boldsymbol{X} \in \mathbb{R}^{m \times n}$). Used in robotics (redundant manipulators), least squares, and signal processing. $\boldsymbol{X}^+$ satisfies $\boldsymbol{X}\boldsymbol{X}^+\boldsymbol{X} = \boldsymbol{X}$, $\boldsymbol{X}^+\boldsymbol{X}\boldsymbol{X}^+ = \boldsymbol{X}^+$, etc.

FormulaNotesProof
\begin{align}d\boldsymbol{X}^+ &= -\boldsymbol{X}^+ (d\boldsymbol{X}) \boldsymbol{X}^+ + \boldsymbol{X}^{+\top}\boldsymbol{X}^\top (d\boldsymbol{X})^\top (\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) \notag\\&\quad + (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})(d\boldsymbol{X})^\top \boldsymbol{X}^{+\top}\boldsymbol{X}^+ \notag\end{align}(full rank, $m \le n$)Golub-Pereyra formula8.10
$d\boldsymbol{X}^+ = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}(d\boldsymbol{X})^\top(\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) - \boldsymbol{X}^+(d\boldsymbol{X})\boldsymbol{X}^+$
(full rank, $m \ge n$, full column rank)
Left inverse type8.11
$d\boldsymbol{X}^+ = (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})(d\boldsymbol{X})^\top(\boldsymbol{X}\boldsymbol{X}^\top)^{-1} - \boldsymbol{X}^+(d\boldsymbol{X})\boldsymbol{X}^+$
(full rank, $m \le n$, full row rank)
Right inverse type8.12
\begin{align}\frac{d\boldsymbol{X}^+}{dt} &= -\boldsymbol{X}^+ \dot{\boldsymbol{X}} \boldsymbol{X}^+ + \boldsymbol{X}^{+\top}\boldsymbol{X}^\top \dot{\boldsymbol{X}}^\top (\boldsymbol{I} - \boldsymbol{X}\boldsymbol{X}^+) \notag\\&\quad + (\boldsymbol{I} - \boldsymbol{X}^+\boldsymbol{X})\dot{\boldsymbol{X}}^\top \boldsymbol{X}^{+\top}\boldsymbol{X}^+ \notag\end{align}(time derivative)Used for robot Jacobian time derivative8.13
$\boldsymbol{X}^+ = \boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{X}^\top)^{-1}$
(full row rank)
Right inverse
$\boldsymbol{X}^+ = (\boldsymbol{X}^\top\boldsymbol{X})^{-1}\boldsymbol{X}^\top$
(full column rank)
Left inverse

In robotics, the pseudoinverse $\boldsymbol{J}^+$ of the Jacobian $\boldsymbol{J}(\boldsymbol{q})$ is used to solve inverse kinematics: $\dot{\boldsymbol{q}} = \boldsymbol{J}^+ \dot{\boldsymbol{x}}$. For redundant manipulators (joints $n$ > workspace dimension $m$), $\boldsymbol{J}^+$ gives the minimum-norm solution.

9. Eigenvalue/Eigenvector Derivatives

Derivative formulas for eigenvalues $\lambda_i$ and eigenvectors $\boldsymbol{v}_i$. See Proof Collection, Chapter 9 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \sum_i \lambda_i(\boldsymbol{X}) = \boldsymbol{I}$Sum of eigenvalues9.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \prod_i \lambda_i(\boldsymbol{X}) = \det(\boldsymbol{X}) \boldsymbol{X}^{-\top}$Product of eigenvalues9.2
$\partial \lambda_i = \boldsymbol{v}_i^\top \partial\boldsymbol{A} \, \boldsymbol{v}_i$
($\boldsymbol{A}$: real symmetric)
Eigenvalue derivative9.3
$\partial \boldsymbol{v}_i = (\lambda_i \boldsymbol{I} - \boldsymbol{A})^+ \partial\boldsymbol{A} \, \boldsymbol{v}_i$
($\boldsymbol{A}$: real symmetric)
Eigenvector derivative9.4

10. Quadratic Form Derivatives

Derivative formulas for vector and matrix quadratic forms. See Proof Collection, Chapter 10 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{a} \boldsymbol{a}^\top$Matrix quadratic form10.1
$\displaystyle\frac{\partial}{\partial X_{ij}} \left(\sum_{k,l} X_{kl}\right)^2 = 2 \displaystyle\sum_{k,l} X_{kl}$Squared component sum10.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{c}) = \boldsymbol{X} (\boldsymbol{b} \boldsymbol{c}^\top + \boldsymbol{c} \boldsymbol{b}^\top)$Gram matrix bilinear form10.3
\begin{align}&\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{B}\boldsymbol{x}+\boldsymbol{b})^\top \boldsymbol{C} (\boldsymbol{D}\boldsymbol{x}+\boldsymbol{d}) \notag\\&= \boldsymbol{B}^\top \boldsymbol{C} (\boldsymbol{D}\boldsymbol{x}+\boldsymbol{d}) + \boldsymbol{D}^\top \boldsymbol{C}^\top (\boldsymbol{B}\boldsymbol{x}+\boldsymbol{b}) \notag\end{align}General quadratic form10.4
$\displaystyle\frac{\partial (\boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X})}{\partial X_{ij}} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} \boldsymbol{B} \boldsymbol{X}$Matrix quadratic form component10.5
$\displaystyle\frac{\partial \boldsymbol{x}^\top \boldsymbol{B} \boldsymbol{x}}{\partial \boldsymbol{x}} = (\boldsymbol{B} + \boldsymbol{B}^\top)\boldsymbol{x}$Vector quadratic form10.6
$\displaystyle\frac{\partial \boldsymbol{b}^\top \boldsymbol{X}^\top \boldsymbol{D} \boldsymbol{X} \boldsymbol{c}}{\partial \boldsymbol{X}} = \boldsymbol{D}^\top \boldsymbol{X} \boldsymbol{b} \boldsymbol{c}^\top + \boldsymbol{D} \boldsymbol{X} \boldsymbol{c} \boldsymbol{b}^\top$Generalized Gram bilinear form10.7
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c})^\top \boldsymbol{D} (\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c}) = (\boldsymbol{D} + \boldsymbol{D}^\top)(\boldsymbol{X}\boldsymbol{b} + \boldsymbol{c})\boldsymbol{b}^\top$Affine quadratic form10.8
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x} - \boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{s}) = 2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{s})$
($\boldsymbol{W}$: symmetric)
Symmetric quadratic ($\boldsymbol{x}$ deriv.)10.9
$\displaystyle\frac{\partial}{\partial \boldsymbol{s}} (\boldsymbol{x} - \boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{s}) = -2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{s})$
($\boldsymbol{W}$: symmetric)
Symmetric quadratic ($\boldsymbol{s}$ deriv.)10.10
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = 2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})$
($\boldsymbol{W}$: symmetric)
Affine symmetric ($\boldsymbol{x}$ deriv.)10.11
$\displaystyle\frac{\partial}{\partial \boldsymbol{s}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = -2\boldsymbol{A}^\top\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})$
($\boldsymbol{W}$: symmetric)
Affine symmetric ($\boldsymbol{s}$ deriv.)10.12
$\displaystyle\frac{\partial}{\partial \boldsymbol{A}} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})^\top \boldsymbol{W} (\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s}) = -2\boldsymbol{W}(\boldsymbol{x} - \boldsymbol{A}\boldsymbol{s})\boldsymbol{s}^\top$
($\boldsymbol{W}$: symmetric)
Affine symmetric ($\boldsymbol{A}$ deriv.)10.13

11. Matrix Powers and Composite Functions

Derivative formulas for functions involving matrix powers $\boldsymbol{X}^n$, composite functions, and the Rayleigh quotient. See Proof Collection, Chapter 11 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial (\boldsymbol{X}^n)_{kl}}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r})_{kl}$Matrix power component11.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r)^\top \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top$Bilinear form of power11.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r)^\top \boldsymbol{X}^n (\boldsymbol{b}\boldsymbol{a}^\top + \boldsymbol{a}\boldsymbol{b}^\top) (\boldsymbol{X}^{n-1-r})^\top$Gram form of power11.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \boldsymbol{s}(\boldsymbol{x})^\top \boldsymbol{A} \boldsymbol{r}(\boldsymbol{x}) = \displaystyle\left[\displaystyle\frac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A} \boldsymbol{r} + \left[\displaystyle\frac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A}^\top \boldsymbol{s}$Composite bilinear form11.4
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \displaystyle\frac{(\boldsymbol{A}\boldsymbol{x})^\top (\boldsymbol{A}\boldsymbol{x})}{(\boldsymbol{B}\boldsymbol{x})^\top (\boldsymbol{B}\boldsymbol{x})} = \displaystyle 2\displaystyle\frac{\boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{B}^\top\boldsymbol{B}\boldsymbol{x}} - 2\displaystyle\frac{\boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x} \cdot \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}}{(\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x})^2}$Rayleigh quotient11.5
$\nabla_{\boldsymbol{x}} f = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{x} + \boldsymbol{b}$
($f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}^\top \boldsymbol{x}$)
Gradient11.6a
$\displaystyle\frac{\partial^2 f}{\partial \boldsymbol{x} \partial \boldsymbol{x}^\top} = \boldsymbol{A} + \boldsymbol{A}^\top$
($f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}^\top \boldsymbol{x}$)
Hessian11.6b

11.1 Matrix Exponential Derivatives

Derivatives of the matrix exponential $e^{\boldsymbol{A}} = \sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{A}^k}{k!}$ (Fréchet derivative). Important in Lie groups/algebras, differential equations, and control theory. Numerical stability is measured by the condition number $\kappa(e^{\boldsymbol{A}}) = \|L(\boldsymbol{A}, \cdot)\|$ ($L$ is the Fréchet derivative operator) (11.10).

FormulaNotesProof
$D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds$
(Fréchet derivative in direction $\boldsymbol{E}$)
Fréchet derivative of matrix exponential11.7
$\displaystyle\frac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} e^{t\boldsymbol{A}} = e^{t\boldsymbol{A}} \boldsymbol{A}$Scalar parameter derivative11.8
$\displaystyle\frac{\partial}{\partial \boldsymbol{A}} \text{tr}(e^{\boldsymbol{A}}) = (e^{\boldsymbol{A}})^\top$Trace of matrix exponential11.9

11.2 Matrix Square Root Gradient

Gradient of the matrix square root $\boldsymbol{A}^{1/2}$ of a positive definite matrix $\boldsymbol{A}$ (satisfying $\boldsymbol{A} = \boldsymbol{A}^{1/2}\boldsymbol{A}^{1/2}$).

FormulaNotesProof
$\displaystyle\frac{\partial L}{\partial \boldsymbol{A}}$
($\boldsymbol{S} = \boldsymbol{A}^{1/2}$)
Solution of Sylvester equation
$\boldsymbol{S}\boldsymbol{X} + \boldsymbol{X}\boldsymbol{S} = \bar{\boldsymbol{S}}$
$\bar{\boldsymbol{S}}$: upstream gradient
$\displaystyle\frac{\partial L}{\partial \boldsymbol{A}} = \boldsymbol{X}$
11.10

12. Norm Derivatives

Derivative formulas for vector and matrix norms. See Proof Collection, Chapter 12 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2}$2-norm derivative12.1
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \displaystyle\frac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \displaystyle\frac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3}$Normalized vector derivative12.2
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x}$Squared 2-norm12.3
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$Squared Frobenius norm12.4
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \displaystyle\frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F}$Frobenius norm12.5
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A})$Squared Frobenius of difference12.6
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B})$Regression residual (left multiply)12.7
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top$Regression residual (right multiply)12.8
$\displaystyle\frac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$Regression weight gradient12.9
$\displaystyle\frac{\partial}{\partial \boldsymbol{W}} \displaystyle\frac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W}$L2 regularization (weight decay)12.10
$\displaystyle\frac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \text{sign}(\boldsymbol{W})$L1 regularization (subgradient)
$[-1, 1]$ at $W_{ij} = 0$
12.11
LASSO gradient
$\displaystyle\frac{\partial}{\partial \boldsymbol{\alpha}}\left(\displaystyle\frac{1}{2}\|\boldsymbol{x} - \boldsymbol{D}\boldsymbol{\alpha}\|^2 + \lambda\|\boldsymbol{\alpha}\|_1\right) = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) + \lambda \cdot \text{sign}(\boldsymbol{\alpha})$
L1-regularized regression
Subgradient
12.12

13. Structured Matrix Derivatives

Derivative formulas for matrices with structure such as symmetric, diagonal, and Toeplitz matrices. See Proof Collection, Chapter 13 for proofs.

FormulaNotesProof
$\displaystyle\frac{df}{dA_{ij}} = \displaystyle\text{tr}\left[\left(\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\right)^\top \boldsymbol{S}^{ij}\right]$
($\boldsymbol{A}$: structured matrix)
Structured matrix derivative (general)13.1
$\displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij}$
($\boldsymbol{A}$: general matrix)
Structured matrix (general)13.2
$\displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij}$
($\boldsymbol{A}$: symmetric)
Structured matrix (symmetric)13.3
\begin{align}&\frac{\partial f}{\partial \boldsymbol{A}} = \frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}} = \frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}} + \left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right)^\top \notag\\&\quad - \text{diag}\left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right) \notag\end{align}(symmetric $\boldsymbol{A}$)Derivative w.r.t. symmetric matrix13.4

Here $\boldsymbol{S}^{ij} = \displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}}$ is the structure matrix, representing how the entire matrix changes when $A_{ij}$ is varied.

13.1 Vec Operator and Related Matrices

The vec operator converts a matrix to a column vector, along with the commutation matrix and duplication matrix. These are fundamental tools for treating matrix derivatives as linear transformations.

FormulaNotesProof
$\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$Vectorization of matrix product13.5
$\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}^\top)$
($\boldsymbol{A}$: $m \times n$)
Commutation matrix13.6
$\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq}$
($\boldsymbol{A}$: $m \times p$, $\boldsymbol{B}$: $n \times q$)
Kronecker product reordering13.7
$\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A})$
($\boldsymbol{A}$: $n \times n$ symmetric)
Duplication matrix13.8
$\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A})$
($\boldsymbol{A}$: $n \times n$ symmetric)
Elimination matrix13.9
$\boldsymbol{L}_n \boldsymbol{D}_n = \boldsymbol{I}_{n(n+1)/2}$Elimination-duplication relation13.10
$\displaystyle\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = \boldsymbol{I}_{mn}$
($\boldsymbol{X}$: $m \times n$)
Vectorization derivative13.11

Here $\text{vec}(\boldsymbol{A})$ stacks the columns of $\boldsymbol{A}$ into a single vector, $\text{vech}(\boldsymbol{A})$ vectorizes the lower triangular part (including diagonal) of a symmetric matrix, and $\otimes$ denotes the Kronecker product.

13.2 Cholesky Decomposition Gradient

Gradient of the Cholesky decomposition $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$ ($\boldsymbol{L}$ lower triangular) for positive definite $\boldsymbol{A}$. Important for Gaussian processes and covariance matrix computations.

FormulaNotesProof
$\displaystyle\frac{\partial L}{\partial \boldsymbol{A}}$
($\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$)
$\boldsymbol{L}^{-\top}\text{tril}(\boldsymbol{L}^\top \bar{\boldsymbol{L}})\boldsymbol{L}^{-1}$
$\bar{\boldsymbol{L}}$: upstream gradient
$\text{tril}$: lower triangular part
13.12
$\displaystyle\frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{A}}$
(via Cholesky)
$\boldsymbol{A}^{-\top}$
$\log|\boldsymbol{A}| = 2\sum_i \log L_{ii}$13.13

14. Matrix Chain Rule

When a matrix $\boldsymbol{U} = f(\boldsymbol{X})$ is a function of matrix $\boldsymbol{X}$, and there is a scalar function $g(\boldsymbol{U})$, these are the composite function derivative formulas. See Proof Collection, Chapter 14 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial g(\boldsymbol{U})}{\partial X_{ij}} = \displaystyle\sum_{k,l} \displaystyle\frac{\partial g}{\partial U_{kl}} \displaystyle\frac{\partial U_{kl}}{\partial X_{ij}}$
($\boldsymbol{U} = f(\boldsymbol{X})$)
Matrix chain rule (component form)14.1
$\displaystyle\frac{\partial g(\boldsymbol{U})}{\partial X_{ij}} = \displaystyle\text{tr}\left[\left(\displaystyle\frac{\partial g}{\partial \boldsymbol{U}}\right)^\top \displaystyle\frac{\partial \boldsymbol{U}}{\partial X_{ij}}\right]$
($\boldsymbol{U} = f(\boldsymbol{X})$)
Matrix chain rule (trace form)14.2

15. Special Matrix Derivatives

Specific derivative formulas for symmetric, diagonal, and Toeplitz matrices. See Proof Collection, Chapter 15 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}} = \boldsymbol{A} + \boldsymbol{A}^\top - (\boldsymbol{A} \circ \boldsymbol{I})$
($\boldsymbol{X}$: symmetric)
Symmetric matrix trace derivative15.1
$\displaystyle\frac{\partial |\boldsymbol{X}|}{\partial \boldsymbol{X}} = |\boldsymbol{X}|(2\boldsymbol{X}^{-1} - (\boldsymbol{X}^{-1} \circ \boldsymbol{I}))$
($\boldsymbol{X}$: symmetric)
Symmetric matrix determinant derivative15.2
$\displaystyle\frac{\partial \log|\boldsymbol{X}|}{\partial \boldsymbol{X}} = 2\boldsymbol{X}^{-1} - (\boldsymbol{X}^{-1} \circ \boldsymbol{I})$
($\boldsymbol{X}$: symmetric)
Symmetric matrix log-det derivative15.3
$\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}} = \boldsymbol{A} \circ \boldsymbol{I}$
($\boldsymbol{X}$: diagonal)
Diagonal matrix trace derivative15.4
$\displaystyle\frac{\partial \text{tr}(\boldsymbol{A}\boldsymbol{T})}{\partial \boldsymbol{T}} = \boldsymbol{\alpha}(\boldsymbol{A})$
($\boldsymbol{T}$: Toeplitz)
Toeplitz matrix trace derivative15.5
$\displaystyle\frac{\partial c(\boldsymbol{A})}{\partial \boldsymbol{A}} = \displaystyle\frac{1}{\lambda_{\min}}\boldsymbol{v}_{\max}\boldsymbol{v}_{\max}^\top - \displaystyle\frac{c(\boldsymbol{A})}{\lambda_{\min}}\boldsymbol{v}_{\min}\boldsymbol{v}_{\min}^\top$
($\boldsymbol{A}$: symmetric positive definite)
Condition number derivative15.6

Here $\boldsymbol{A} \circ \boldsymbol{I}$ is the Hadamard product that retains only diagonal elements, $\boldsymbol{\alpha}(\boldsymbol{A})$ is the matrix whose components are the diagonal sums of $\boldsymbol{A}^\top$, and $c(\boldsymbol{A}) = \lambda_{\max}/\lambda_{\min}$ is the condition number.

16. Complex Matrix Derivatives

Wirtinger derivatives for functions involving complex conjugates, and derivative formulas for complex traces. See Proof Collection, Chapter 16 for proofs.

FormulaNotesProof
$\displaystyle\frac{\partial f}{\partial z}$, $\displaystyle\frac{\partial f}{\partial z^*} = \displaystyle\frac{1}{2}\left(\displaystyle\frac{\partial f}{\partial \Re z} \mp i\displaystyle\frac{\partial f}{\partial \Im z}\right)$Wirtinger derivative16.1
$\nabla f(\boldsymbol{z}) = \displaystyle 2\displaystyle\frac{\partial f(\boldsymbol{z})}{\partial \boldsymbol{z}^*}$
($f$: real-valued)
Complex gradient vector16.2
$\displaystyle\frac{\partial g}{\partial z} = \displaystyle\frac{\partial g}{\partial f}\displaystyle\frac{\partial f}{\partial z} + \displaystyle\frac{\partial g}{\partial f^*}\displaystyle\frac{\partial f^*}{\partial z}$
(composite function)
Complex chain rule16.3
$\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}^*)}{\partial \Re\boldsymbol{X}} = \boldsymbol{I}$Complex conjugate trace derivative16.4
$\displaystyle\frac{\partial \text{Tr}(\boldsymbol{A}\boldsymbol{X}^H)}{\partial \Re\boldsymbol{X}} = \boldsymbol{A}$Hermitian trace derivative16.6
$\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}\boldsymbol{X}^H)}{\partial \Re\boldsymbol{X}} = 2\Re\boldsymbol{X}$Frobenius norm derivative16.8
$\displaystyle\frac{\partial \text{Tr}(\boldsymbol{X}\boldsymbol{X}^H)}{\partial \boldsymbol{X}} = \boldsymbol{X}^*$Wirtinger derivative16.9
$\nabla\|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$Complex Frobenius norm gradient16.10
$\displaystyle\frac{\partial \det(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})}{\partial \boldsymbol{X}^*} = \det(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})\boldsymbol{A}\boldsymbol{X}(\boldsymbol{X}^H\boldsymbol{A}\boldsymbol{X})^{-1}$Complex determinant derivative16.11
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}}\displaystyle\frac{(\boldsymbol{A}\boldsymbol{x})^H(\boldsymbol{A}\boldsymbol{x})}{(\boldsymbol{B}\boldsymbol{x})^H(\boldsymbol{B}\boldsymbol{x})}$
(complex Rayleigh quotient)
Complex Rayleigh quotient derivative16.12
$\displaystyle\frac{\partial (a - \boldsymbol{x}^H \boldsymbol{b})^2}{\partial \boldsymbol{x}} = -2\bar{\boldsymbol{b}}(a - \boldsymbol{x}^H \boldsymbol{b})^*$Complex quadratic form derivative16.13
$\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{w}^H\boldsymbol{z}) = \boldsymbol{0}$, $\quad\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{z}^H\boldsymbol{w}) = \boldsymbol{w}$Inner product Wirtinger derivative16.57
$\displaystyle\frac{\partial}{\partial \boldsymbol{z}^*}(\boldsymbol{z}^H\boldsymbol{A}\boldsymbol{z}) = \boldsymbol{A}\boldsymbol{z}$
($\boldsymbol{A}$: Hermitian)
Hermitian quadratic form Wirtinger derivative16.57

Here $\boldsymbol{X}^H = (\boldsymbol{X}^*)^\top$ is the Hermitian transpose, $\boldsymbol{X}^*$ is the element-wise complex conjugate, and $\bar{\boldsymbol{b}}$ is the complex conjugate of $\boldsymbol{b}$.

Proofs

Detailed proofs for all formulas can be found in the Matrix Calculus Proof Collection.

Appendix A. Correspondence with Numerator Layout

A.1 Shape of the Gradient Vector

For the gradient of a scalar $f$ with respect to a vector $\boldsymbol{x} \in \mathbb{R}^n$:

  • Numerator layout: $\nabla f = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^{1 \times n}$ (row vector)
  • Denominator layout (this document): $\nabla f = \displaystyle\frac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^{n \times 1}$ (column vector)

In optimization, when "moving in the gradient direction," the denominator layout allows directly adding $-\nabla f$, while the numerator layout requires the transpose $(\nabla f)^T$.

A.2 Jacobian Matrix Definition

For the Jacobian of a vector-valued function $\boldsymbol{f}: \mathbb{R}^n \to \mathbb{R}^m$:

  • Numerator layout: $\boldsymbol{J} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \in \mathbb{R}^{m \times n}$ ($(i,j)$ entry is $\displaystyle\frac{\partial f_i}{\partial x_j}$)
  • Denominator layout (this document): $\boldsymbol{J} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \in \mathbb{R}^{n \times m}$ ($(i,j)$ entry is $\displaystyle\frac{\partial f_j}{\partial x_i}$)

The two are related by transposition: $\boldsymbol{J}_{\text{denom}} = \boldsymbol{J}_{\text{numer}}^T$

A.3 Chain Rule Form

For the derivative of a composite function $\boldsymbol{g}(\boldsymbol{f}(\boldsymbol{x}))$:

  • Numerator layout: $\displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}} \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}}$ (multiply from left)
  • Denominator layout (this document): $\displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{x}} = \displaystyle\frac{\partial \boldsymbol{f}}{\partial \boldsymbol{x}} \displaystyle\frac{\partial \boldsymbol{g}}{\partial \boldsymbol{f}}$ (multiply from right)

When implementing neural network backpropagation, verify which convention is being used and set the matrix product order correctly.

A.4 Key Formula Correspondence Table

Formula Denominator Layout (this document) Numerator Layout
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{a}^T \boldsymbol{x})$ $\boldsymbol{a}$ $\boldsymbol{a}^T$
$\displaystyle\frac{\partial}{\partial \boldsymbol{x}}(\boldsymbol{x}^T \boldsymbol{A} \boldsymbol{x})$ $(\boldsymbol{A} + \boldsymbol{A}^T)\boldsymbol{x}$ $\boldsymbol{x}^T(\boldsymbol{A} + \boldsymbol{A}^T)$
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}}(\boldsymbol{a}^T \boldsymbol{X} \boldsymbol{b})$ $\boldsymbol{a}\boldsymbol{b}^T$ $\boldsymbol{b}\boldsymbol{a}^T$
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\mathrm{tr}(\boldsymbol{A}\boldsymbol{X})$ $\boldsymbol{A}^T$ $\boldsymbol{A}$
$\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\log|\boldsymbol{X}|$ $(\boldsymbol{X}^{-1})^T = \boldsymbol{X}^{-T}$ $\boldsymbol{X}^{-1}$

When cross-referencing with other literature, first check whether the gradient vector is a row or column vector, then apply the formulas accordingly.

Applied Formulas

Applications of this formula sheet to various fields are summarized below. See the Proof Collection for detailed proofs.

Machine Learning & Information Science

Machine Learning Applications

Neural networks, deep learning, reinforcement learning, NLP

Financial Engineering Applications

Portfolio optimization, Sharpe ratio, Bordered Hessian

Natural Sciences & Engineering

Statistics Applications

Mixture models, BLUP/REML, kriging, factor analysis, SEM, IRT

Engineering Applications

Control theory, robotics, FEM, mechanics of materials, and 8 other fields

Astronomy Applications

Orbital mechanics, two-body problem, perturbation theory, aberration, redshift

Geophysics Applications

Seismic tomography, travel-time partial derivatives, sensitivity kernels

Biology Applications

Lotka-Volterra, SIR model, Wright-Fisher, phylogenetics

Pose & Rotation Applications

SO(3), quaternions, Euler angles, inertia tensor

Molecular Dynamics Applications

Lennard-Jones, harmonic oscillator, Coulomb, bond angles

References & Related Articles

Related Articles

Key References

  • Magnus, J. R. & Neudecker, H. (2019). Matrix Differential Calculus with Applications in Statistics and Econometrics, 3rd ed. Wiley. — Standard textbook on matrix calculus. Uses denominator layout.
  • The Matrix Cookbook - Petersen & Pedersen (2012) — Widely referenced formula collection.
  • Absil, P.-A., Mahony, R. & Sepulchre, R. (2008). Optimization Algorithms on Matrix Manifolds. Princeton University Press. — Optimization on manifolds and matrix calculus.
  • Edelman, A., Arias, T. A. & Smith, S. T. (1998). The Geometry of Algorithms with Orthogonality Constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353. — Geometric foundations for optimization with orthogonality constraints.
  • Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: A Survey. J. Mach. Learn. Res. 18(153), 1–43. — Comprehensive survey on automatic differentiation.
  • Matrix calculus - Wikipedia

Notes

  1. Formal definition of the Fréchet derivative: a mapping $f: \mathbb{R}^n \to \mathbb{R}^m$ is Fréchet differentiable at $\boldsymbol{x}$ if there exists a linear mapping $Df(\boldsymbol{x}): \mathbb{R}^n \to \mathbb{R}^m$ such that $\displaystyle\lim_{\|\boldsymbol{h}\| \to 0} \frac{\|f(\boldsymbol{x}+\boldsymbol{h}) - f(\boldsymbol{x}) - Df(\boldsymbol{x})[\boldsymbol{h}]\|}{\|\boldsymbol{h}\|} = 0$.