Proofs Chapter 4: Basic Formulas of Matrix Calculus

Proofs Chapter 4: Basic Formulas of Matrix Calculus

This chapter proves the basic formulas of matrix calculus. The differentiation rules for matrix products, transposes, and bilinear forms serve as the foundation for all subsequent chapters. These formulas are routinely used in deep learning backpropagation, Kalman filter derivation, and Riccati equation analysis in control theory. Each proof starts from the component representation of matrices and summarizes the result in matrix form.

Prerequisites: Chapter 2, Chapter 3. Chapters that use results from this chapter: Chapter 5 (Trace), Chapter 7 (Determinant), Chapter 8 (Inverse).

4. Basic Formulas of Matrix Calculus

Assumptions for this chapter
Unless otherwise stated, all formulas in this chapter hold under the following conditions:
  • All formulas use the denominator layout convention
  • The derivative of a scalar $f$ with respect to a matrix $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ yields $\frac{\partial f}{\partial \boldsymbol{X}} \in \mathbb{R}^{M \times N}$ (same size)
  • Functions are defined on a differentiable open set

4.1 Bilinear Form

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \boldsymbol{a} \boldsymbol{b}^\top$
Conditions: $\boldsymbol{a} \in \mathbb{R}^M$ is a constant $M$-dimensional vector, $\boldsymbol{b} \in \mathbb{R}^N$ is a constant $N$-dimensional vector, $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} \in \mathbb{R}$ is a scalar
Proof

We verify the structure of the matrix-vector product. Since $\boldsymbol{a}^\top$ is a $1 \times M$ row vector, $\boldsymbol{X}$ is an $M \times N$ matrix, and $\boldsymbol{b}$ is an $N \times 1$ column vector:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} : (1 \times M) \cdot (M \times N) \cdot (N \times 1) = 1 \times 1 \label{eq:4-1-1} \end{equation}

Thus the result is a scalar.

First, computing $\boldsymbol{a}^\top \boldsymbol{X}$ gives a $1 \times N$ row vector:

\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X})_j = \sum_{i=0}^{M-1} a_i X_{ij} \label{eq:4-1-2} \end{equation}

Next, computing $(\boldsymbol{a}^\top \boldsymbol{X}) \boldsymbol{b}$ gives a scalar (inner product of a row vector and a column vector):

\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X}) \boldsymbol{b} = \sum_{j=0}^{N-1} (\boldsymbol{a}^\top \boldsymbol{X})_j \cdot b_j \label{eq:4-1-3} \end{equation}

Substituting \eqref{eq:4-1-2} into \eqref{eq:4-1-3}:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} = \sum_{j=0}^{N-1} \left( \sum_{i=0}^{M-1} a_i X_{ij} \right) b_j \label{eq:4-1-4} \end{equation}

Since the sums are finite, we can swap the order of summation:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i X_{ij} b_j \label{eq:4-1-5} \end{equation}

Taking the partial derivative of this scalar with respect to the $(m, n)$ entry $X_{mn}$. Since $a_i$ and $b_j$ are constants:

\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i \frac{\partial X_{ij}}{\partial X_{mn}} b_j \label{eq:4-1-6} \end{equation}

$X_{ij}$ and $X_{mn}$ are the same variable only when $(i, j) = (m, n)$; otherwise they are independent. Using the Kronecker delta:

\begin{equation} \frac{\partial X_{ij}}{\partial X_{mn}} = \delta_{im} \delta_{jn} \label{eq:4-1-7} \end{equation}

Substituting \eqref{eq:4-1-7} into \eqref{eq:4-1-6}:

\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i \delta_{im} \delta_{jn} b_j \label{eq:4-1-8} \end{equation}

Since $\delta_{im} = 1$ only when $i = m$, we get $\sum_{i=0}^{M-1} a_i \delta_{im} = a_m$. Similarly, $\sum_{j=0}^{N-1} \delta_{jn} b_j = b_n$. Therefore:

\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = a_m b_n \label{eq:4-1-9} \end{equation}

The $M \times N$ matrix whose $(m, n)$ entry is $a_m b_n$ is the outer product $\boldsymbol{a} \boldsymbol{b}^\top$:

\begin{equation} (\boldsymbol{a} \boldsymbol{b}^\top)_{mn} = a_m b_n \label{eq:4-1-10} \end{equation}

Thus we obtain the final result in matrix form:

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \boldsymbol{a} \boldsymbol{b}^\top \label{eq:4-1-11} \end{equation}

Remark: This formula is fundamental to differentiating bilinear forms. $\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}$ is linear in $\boldsymbol{a}$, $\boldsymbol{b}$, and $\boldsymbol{X}$. The result $\boldsymbol{a} \boldsymbol{b}^\top$ is a rank-1 matrix, indicating that the gradient is concentrated in a specific direction.

4.2 Bilinear Form with Transpose

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b}) = \boldsymbol{b} \boldsymbol{a}^\top$
Conditions: $\boldsymbol{a} \in \mathbb{R}^N$ is a constant $N$-dimensional vector, $\boldsymbol{b} \in \mathbb{R}^M$ is a constant $M$-dimensional vector, $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$
Proof

We verify the structure of the matrix-vector product. Since $\boldsymbol{a}^\top$ is $1 \times N$, $\boldsymbol{X}^\top$ is $N \times M$, and $\boldsymbol{b}$ is $M \times 1$:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} : (1 \times N) \cdot (N \times M) \cdot (M \times 1) = 1 \times 1 \label{eq:4-2-1} \end{equation}

Thus the result is a scalar.

The transpose of a scalar equals itself:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b})^\top \label{eq:4-2-2} \end{equation}

Applying the transpose-of-product rule $(\boldsymbol{ABC})^\top = \boldsymbol{C}^\top \boldsymbol{B}^\top \boldsymbol{A}^\top$:

\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b})^\top = \boldsymbol{b}^\top (\boldsymbol{X}^\top)^\top (\boldsymbol{a}^\top)^\top \label{eq:4-2-3} \end{equation}

Using $(\boldsymbol{X}^\top)^\top = \boldsymbol{X}$ and $(\boldsymbol{a}^\top)^\top = \boldsymbol{a}$:

\begin{equation} \boldsymbol{b}^\top (\boldsymbol{X}^\top)^\top (\boldsymbol{a}^\top)^\top = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a} \label{eq:4-2-4} \end{equation}

Combining \eqref{eq:4-2-2}–\eqref{eq:4-2-4}:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a} \label{eq:4-2-5} \end{equation}

By Formula 4.1, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{c}^\top \boldsymbol{X} \boldsymbol{d}) = \boldsymbol{c} \boldsymbol{d}^\top$. Matching the right-hand side of \eqref{eq:4-2-5} with $\boldsymbol{c} = \boldsymbol{b}$, $\boldsymbol{d} = \boldsymbol{a}$:

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{b} \boldsymbol{a}^\top \label{eq:4-2-6} \end{equation}

Since $\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}$ by \eqref{eq:4-2-5}, we obtain the final result:

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b}) = \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{b} \boldsymbol{a}^\top \label{eq:4-2-7} \end{equation}

Remark: This formula is a variant of Formula 4.1. Replacing $\boldsymbol{X}$ with $\boldsymbol{X}^\top$ swaps the roles of $\boldsymbol{a}$ and $\boldsymbol{b}$, changing the result from $\boldsymbol{a} \boldsymbol{b}^\top$ to $\boldsymbol{b} \boldsymbol{a}^\top$. This reflects the transpose operation's effect of interchanging left and right.

4.3 Single-Entry Matrix and Component Derivative

Formula: $\displaystyle\frac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $X_{ij}$ is the $(i,j)$ entry of $\boldsymbol{X}$, $\boldsymbol{J}^{ij} \in \mathbb{R}^{M \times N}$ is the matrix with 1 at position $(i,j)$ and 0 elsewhere
Proof

We recall the definition of the single-entry matrix $\boldsymbol{J}^{ij}$. It is the $M \times N$ matrix with 1 at position $(i,j)$ and 0 at all other positions:

\begin{equation} (\boldsymbol{J}^{ij})_{kl} = \begin{cases} 1 & \text{if } k = i \text{ and } l = j \\ 0 & \text{otherwise} \end{cases} \label{eq:4-3-1} \end{equation}

Expressed using the Kronecker delta:

\begin{equation} (\boldsymbol{J}^{ij})_{kl} = \delta_{ki} \delta_{lj} \label{eq:4-3-2} \end{equation}

Consider differentiating the matrix $\boldsymbol{X}$ with respect to its entry $X_{ij}$. The partial derivative of a matrix means differentiating each entry and assembling the results as a matrix:

\begin{equation} \left( \frac{\partial \boldsymbol{X}}{\partial X_{ij}} \right)_{kl} = \frac{\partial X_{kl}}{\partial X_{ij}} \label{eq:4-3-3} \end{equation}

Since each entry of the matrix is an independent variable, $\frac{\partial X_{kl}}{\partial X_{ij}}$ equals 1 only when $(k, l) = (i, j)$. In terms of the Kronecker delta:

\begin{equation} \frac{\partial X_{kl}}{\partial X_{ij}} = \delta_{ki} \delta_{lj} \label{eq:4-3-4} \end{equation}

Comparing \eqref{eq:4-3-2} and \eqref{eq:4-3-4}:

\begin{equation} \frac{\partial X_{kl}}{\partial X_{ij}} = \delta_{ki} \delta_{lj} = (\boldsymbol{J}^{ij})_{kl} \label{eq:4-3-5} \end{equation}

Since \eqref{eq:4-3-5} holds for all $(k, l)$, the matrix equality follows:

\begin{equation} \frac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij} \label{eq:4-3-6} \end{equation}

Remark: $\boldsymbol{J}^{ij}$ is called a single-entry matrix and forms a basis for the space of matrices. Any matrix $\boldsymbol{X}$ can be written as $\boldsymbol{X} = \sum_{i,j} X_{ij} \boldsymbol{J}^{ij}$. This formula is fundamental for expressing matrix product derivatives in component form.

4.4 Component Derivative of Matrix Product

Formula: $\displaystyle\frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} = (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{M \times K}$ is an $M \times K$ matrix variable, $\boldsymbol{A} \in \mathbb{R}^{K \times N}$ is a $K \times N$ constant matrix, $\boldsymbol{X}\boldsymbol{A} \in \mathbb{R}^{M \times N}$, $\boldsymbol{J}^{mn} \in \mathbb{R}^{M \times K}$ is the matrix with 1 at position $(m,n)$
Proof

Writing the $(i, j)$ entry of the matrix product by definition:

\begin{equation} (\boldsymbol{X}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} X_{ik} A_{kj} \label{eq:4-4-1} \end{equation}

Differentiating this sum with respect to $X_{mn}$. Since $A_{kj}$ is a constant:

\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \frac{\partial X_{ik}}{\partial X_{mn}} A_{kj} \label{eq:4-4-2} \end{equation}

$X_{ik}$ and $X_{mn}$ are the same variable only when $(i, k) = (m, n)$. Using the Kronecker delta:

\begin{equation} \frac{\partial X_{ik}}{\partial X_{mn}} = \delta_{im} \delta_{kn} \label{eq:4-4-3} \end{equation}

Substituting \eqref{eq:4-4-3} into \eqref{eq:4-4-2}:

\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \delta_{im} \delta_{kn} A_{kj} \label{eq:4-4-4} \end{equation}

Since $\delta_{im}$ does not depend on $k$, it factors out, and $\sum_{k=0}^{K-1} \delta_{kn} A_{kj} = A_{nj}$:

\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} \label{eq:4-4-5} \end{equation}

We show this equals $(\boldsymbol{J}^{mn}\boldsymbol{A})_{ij}$. The $(i, j)$ entry of $\boldsymbol{J}^{mn}\boldsymbol{A}$ is:

\begin{equation} (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} (\boldsymbol{J}^{mn})_{ik} A_{kj} \label{eq:4-4-6} \end{equation}

By definition of $\boldsymbol{J}^{mn}$, substituting $(\boldsymbol{J}^{mn})_{ik} = \delta_{im} \delta_{kn}$:

\begin{equation} (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} \delta_{im} \delta_{kn} A_{kj} = \delta_{im} A_{nj} \label{eq:4-4-7} \end{equation}

Comparing \eqref{eq:4-4-5} and \eqref{eq:4-4-7}, we obtain the final result:

\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} = (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} \label{eq:4-4-8} \end{equation}

Remark: This formula shows how the $(i, j)$ entry of the result changes when differentiating with respect to a specific entry $X_{mn}$. The factor $\delta_{im}$ means the result is 0 when $i \neq m$. This reflects the fact that changing $X_{mn}$ only affects row $m$ of the product $\boldsymbol{X}\boldsymbol{A}$.
X 0 m 2 × A = XA row 0 row m row 2 Only row m is affected

Figure 1: A change in $X_{mn}$ affects only row $m$ of $\boldsymbol{XA}$ (effect of $\delta_{im}$)

4.5 Component Derivative of Transposed Matrix Product

Formula: $\displaystyle\frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} = (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{K \times M}$ is a $K \times M$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{M \times K}$, $\boldsymbol{A} \in \mathbb{R}^{K \times N}$ is a $K \times N$ constant matrix, $\boldsymbol{X}^\top\boldsymbol{A} \in \mathbb{R}^{M \times N}$, $\boldsymbol{J}^{nm} \in \mathbb{R}^{M \times K}$ is the matrix with 1 at position $(n,m)$
Proof

We recall the entries of the transposed matrix. The $(i, k)$ entry of $\boldsymbol{X}^\top$ equals the $(k, i)$ entry of $\boldsymbol{X}$:

\begin{equation} (\boldsymbol{X}^\top)_{ik} = X_{ki} \label{eq:4-5-1} \end{equation}

Writing the $(i, j)$ entry of $\boldsymbol{X}^\top\boldsymbol{A}$ by definition:

\begin{equation} (\boldsymbol{X}^\top\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} (\boldsymbol{X}^\top)_{ik} A_{kj} = \sum_{k=0}^{K-1} X_{ki} A_{kj} \label{eq:4-5-2} \end{equation}

Differentiating this sum with respect to $X_{mn}$. Since $A_{kj}$ is a constant:

\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \frac{\partial X_{ki}}{\partial X_{mn}} A_{kj} \label{eq:4-5-3} \end{equation}

$X_{ki}$ and $X_{mn}$ are the same variable only when $(k, i) = (m, n)$. Using the Kronecker delta:

\begin{equation} \frac{\partial X_{ki}}{\partial X_{mn}} = \delta_{km} \delta_{in} \label{eq:4-5-4} \end{equation}

Substituting \eqref{eq:4-5-4} into \eqref{eq:4-5-3}:

\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \delta_{km} \delta_{in} A_{kj} \label{eq:4-5-5} \end{equation}

Since $\delta_{in}$ does not depend on $k$, it factors out, and $\sum_{k=0}^{K-1} \delta_{km} A_{kj} = A_{mj}$:

\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} \label{eq:4-5-6} \end{equation}

We show this equals $(\boldsymbol{J}^{nm}\boldsymbol{A})_{ij}$. Since $\boldsymbol{J}^{nm} \in \mathbb{R}^{M \times K}$ has 1 only at position $(n, m)$:

\begin{equation} (\boldsymbol{J}^{nm})_{ik} = \delta_{in} \delta_{km} \label{eq:4-5-7} \end{equation}

Computing the $(i, j)$ entry of $\boldsymbol{J}^{nm}\boldsymbol{A}$:

\begin{equation} (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} \delta_{in} \delta_{km} A_{kj} = \delta_{in} A_{mj} \label{eq:4-5-8} \end{equation}

Comparing \eqref{eq:4-5-6} and \eqref{eq:4-5-8}, we obtain the final result:

\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} = (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij} \label{eq:4-5-9} \end{equation}

Remark: Compared to Formula 4.4, the transpose changes the index correspondence. In 4.4 the factor is $\delta_{im}$, while in 4.5 it becomes $\delta_{in}$. Also, $\boldsymbol{J}^{mn}$ becomes $\boldsymbol{J}^{nm}$. This reflects the transpose operation swapping the roles of rows and columns. A change in $X_{mn}$ affects position $(n, m)$ of $\boldsymbol{X}^\top$, which in turn affects only row $n$ of $\boldsymbol{X}^\top\boldsymbol{A}$.
X 0 m 0 1 n transpose Xᵀ 0 1 n 0 m × A = XᵀA row 0 row 1 row n Transpose shifts the affected row from m to n

Figure 2: A change in $X_{mn}$ affects position $(n,m)$ in $\boldsymbol{X}^\top$, thus only row $n$ of $\boldsymbol{X}^\top\boldsymbol{A}$ (effect of $\delta_{in}$)

References

  • Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
  • Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
  • Matrix calculus - Wikipedia