Proofs Chapter 4: Basic Formulas of Matrix Calculus
Proofs Chapter 4: Basic Formulas of Matrix Calculus
This chapter proves the basic formulas of matrix calculus. The differentiation rules for matrix products, transposes, and bilinear forms serve as the foundation for all subsequent chapters. These formulas are routinely used in deep learning backpropagation, Kalman filter derivation, and Riccati equation analysis in control theory. Each proof starts from the component representation of matrices and summarizes the result in matrix form.
Prerequisites: Chapter 2, Chapter 3. Chapters that use results from this chapter: Chapter 5 (Trace), Chapter 7 (Determinant), Chapter 8 (Inverse).
4. Basic Formulas of Matrix Calculus
Unless otherwise stated, all formulas in this chapter hold under the following conditions:
- All formulas use the denominator layout convention
- The derivative of a scalar $f$ with respect to a matrix $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ yields $\frac{\partial f}{\partial \boldsymbol{X}} \in \mathbb{R}^{M \times N}$ (same size)
- Functions are defined on a differentiable open set
4.1 Bilinear Form
Proof
We verify the structure of the matrix-vector product. Since $\boldsymbol{a}^\top$ is a $1 \times M$ row vector, $\boldsymbol{X}$ is an $M \times N$ matrix, and $\boldsymbol{b}$ is an $N \times 1$ column vector:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} : (1 \times M) \cdot (M \times N) \cdot (N \times 1) = 1 \times 1 \label{eq:4-1-1} \end{equation}
Thus the result is a scalar.
First, computing $\boldsymbol{a}^\top \boldsymbol{X}$ gives a $1 \times N$ row vector:
\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X})_j = \sum_{i=0}^{M-1} a_i X_{ij} \label{eq:4-1-2} \end{equation}
Next, computing $(\boldsymbol{a}^\top \boldsymbol{X}) \boldsymbol{b}$ gives a scalar (inner product of a row vector and a column vector):
\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X}) \boldsymbol{b} = \sum_{j=0}^{N-1} (\boldsymbol{a}^\top \boldsymbol{X})_j \cdot b_j \label{eq:4-1-3} \end{equation}
Substituting \eqref{eq:4-1-2} into \eqref{eq:4-1-3}:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} = \sum_{j=0}^{N-1} \left( \sum_{i=0}^{M-1} a_i X_{ij} \right) b_j \label{eq:4-1-4} \end{equation}
Since the sums are finite, we can swap the order of summation:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b} = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i X_{ij} b_j \label{eq:4-1-5} \end{equation}
Taking the partial derivative of this scalar with respect to the $(m, n)$ entry $X_{mn}$. Since $a_i$ and $b_j$ are constants:
\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i \frac{\partial X_{ij}}{\partial X_{mn}} b_j \label{eq:4-1-6} \end{equation}
$X_{ij}$ and $X_{mn}$ are the same variable only when $(i, j) = (m, n)$; otherwise they are independent. Using the Kronecker delta:
\begin{equation} \frac{\partial X_{ij}}{\partial X_{mn}} = \delta_{im} \delta_{jn} \label{eq:4-1-7} \end{equation}
Substituting \eqref{eq:4-1-7} into \eqref{eq:4-1-6}:
\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} a_i \delta_{im} \delta_{jn} b_j \label{eq:4-1-8} \end{equation}
Since $\delta_{im} = 1$ only when $i = m$, we get $\sum_{i=0}^{M-1} a_i \delta_{im} = a_m$. Similarly, $\sum_{j=0}^{N-1} \delta_{jn} b_j = b_n$. Therefore:
\begin{equation} \frac{\partial}{\partial X_{mn}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = a_m b_n \label{eq:4-1-9} \end{equation}
The $M \times N$ matrix whose $(m, n)$ entry is $a_m b_n$ is the outer product $\boldsymbol{a} \boldsymbol{b}^\top$:
\begin{equation} (\boldsymbol{a} \boldsymbol{b}^\top)_{mn} = a_m b_n \label{eq:4-1-10} \end{equation}
Thus we obtain the final result in matrix form:
\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X} \boldsymbol{b}) = \boldsymbol{a} \boldsymbol{b}^\top \label{eq:4-1-11} \end{equation}
4.2 Bilinear Form with Transpose
Proof
We verify the structure of the matrix-vector product. Since $\boldsymbol{a}^\top$ is $1 \times N$, $\boldsymbol{X}^\top$ is $N \times M$, and $\boldsymbol{b}$ is $M \times 1$:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} : (1 \times N) \cdot (N \times M) \cdot (M \times 1) = 1 \times 1 \label{eq:4-2-1} \end{equation}
Thus the result is a scalar.
The transpose of a scalar equals itself:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b})^\top \label{eq:4-2-2} \end{equation}
Applying the transpose-of-product rule $(\boldsymbol{ABC})^\top = \boldsymbol{C}^\top \boldsymbol{B}^\top \boldsymbol{A}^\top$:
\begin{equation} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b})^\top = \boldsymbol{b}^\top (\boldsymbol{X}^\top)^\top (\boldsymbol{a}^\top)^\top \label{eq:4-2-3} \end{equation}
Using $(\boldsymbol{X}^\top)^\top = \boldsymbol{X}$ and $(\boldsymbol{a}^\top)^\top = \boldsymbol{a}$:
\begin{equation} \boldsymbol{b}^\top (\boldsymbol{X}^\top)^\top (\boldsymbol{a}^\top)^\top = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a} \label{eq:4-2-4} \end{equation}
Combining \eqref{eq:4-2-2}–\eqref{eq:4-2-4}:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a} \label{eq:4-2-5} \end{equation}
By Formula 4.1, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{c}^\top \boldsymbol{X} \boldsymbol{d}) = \boldsymbol{c} \boldsymbol{d}^\top$. Matching the right-hand side of \eqref{eq:4-2-5} with $\boldsymbol{c} = \boldsymbol{b}$, $\boldsymbol{d} = \boldsymbol{a}$:
\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{b} \boldsymbol{a}^\top \label{eq:4-2-6} \end{equation}
Since $\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b} = \boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}$ by \eqref{eq:4-2-5}, we obtain the final result:
\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{a}^\top \boldsymbol{X}^\top \boldsymbol{b}) = \frac{\partial}{\partial \boldsymbol{X}} (\boldsymbol{b}^\top \boldsymbol{X} \boldsymbol{a}) = \boldsymbol{b} \boldsymbol{a}^\top \label{eq:4-2-7} \end{equation}
4.3 Single-Entry Matrix and Component Derivative
Proof
We recall the definition of the single-entry matrix $\boldsymbol{J}^{ij}$. It is the $M \times N$ matrix with 1 at position $(i,j)$ and 0 at all other positions:
\begin{equation} (\boldsymbol{J}^{ij})_{kl} = \begin{cases} 1 & \text{if } k = i \text{ and } l = j \\ 0 & \text{otherwise} \end{cases} \label{eq:4-3-1} \end{equation}
Expressed using the Kronecker delta:
\begin{equation} (\boldsymbol{J}^{ij})_{kl} = \delta_{ki} \delta_{lj} \label{eq:4-3-2} \end{equation}
Consider differentiating the matrix $\boldsymbol{X}$ with respect to its entry $X_{ij}$. The partial derivative of a matrix means differentiating each entry and assembling the results as a matrix:
\begin{equation} \left( \frac{\partial \boldsymbol{X}}{\partial X_{ij}} \right)_{kl} = \frac{\partial X_{kl}}{\partial X_{ij}} \label{eq:4-3-3} \end{equation}
Since each entry of the matrix is an independent variable, $\frac{\partial X_{kl}}{\partial X_{ij}}$ equals 1 only when $(k, l) = (i, j)$. In terms of the Kronecker delta:
\begin{equation} \frac{\partial X_{kl}}{\partial X_{ij}} = \delta_{ki} \delta_{lj} \label{eq:4-3-4} \end{equation}
Comparing \eqref{eq:4-3-2} and \eqref{eq:4-3-4}:
\begin{equation} \frac{\partial X_{kl}}{\partial X_{ij}} = \delta_{ki} \delta_{lj} = (\boldsymbol{J}^{ij})_{kl} \label{eq:4-3-5} \end{equation}
Since \eqref{eq:4-3-5} holds for all $(k, l)$, the matrix equality follows:
\begin{equation} \frac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij} \label{eq:4-3-6} \end{equation}
4.4 Component Derivative of Matrix Product
Proof
Writing the $(i, j)$ entry of the matrix product by definition:
\begin{equation} (\boldsymbol{X}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} X_{ik} A_{kj} \label{eq:4-4-1} \end{equation}
Differentiating this sum with respect to $X_{mn}$. Since $A_{kj}$ is a constant:
\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \frac{\partial X_{ik}}{\partial X_{mn}} A_{kj} \label{eq:4-4-2} \end{equation}
$X_{ik}$ and $X_{mn}$ are the same variable only when $(i, k) = (m, n)$. Using the Kronecker delta:
\begin{equation} \frac{\partial X_{ik}}{\partial X_{mn}} = \delta_{im} \delta_{kn} \label{eq:4-4-3} \end{equation}
Substituting \eqref{eq:4-4-3} into \eqref{eq:4-4-2}:
\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \delta_{im} \delta_{kn} A_{kj} \label{eq:4-4-4} \end{equation}
Since $\delta_{im}$ does not depend on $k$, it factors out, and $\sum_{k=0}^{K-1} \delta_{kn} A_{kj} = A_{nj}$:
\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} \label{eq:4-4-5} \end{equation}
We show this equals $(\boldsymbol{J}^{mn}\boldsymbol{A})_{ij}$. The $(i, j)$ entry of $\boldsymbol{J}^{mn}\boldsymbol{A}$ is:
\begin{equation} (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} (\boldsymbol{J}^{mn})_{ik} A_{kj} \label{eq:4-4-6} \end{equation}
By definition of $\boldsymbol{J}^{mn}$, substituting $(\boldsymbol{J}^{mn})_{ik} = \delta_{im} \delta_{kn}$:
\begin{equation} (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} \delta_{im} \delta_{kn} A_{kj} = \delta_{im} A_{nj} \label{eq:4-4-7} \end{equation}
Comparing \eqref{eq:4-4-5} and \eqref{eq:4-4-7}, we obtain the final result:
\begin{equation} \frac{\partial (\boldsymbol{X}\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{im} A_{nj} = (\boldsymbol{J}^{mn}\boldsymbol{A})_{ij} \label{eq:4-4-8} \end{equation}
Figure 1: A change in $X_{mn}$ affects only row $m$ of $\boldsymbol{XA}$ (effect of $\delta_{im}$)
4.5 Component Derivative of Transposed Matrix Product
Proof
We recall the entries of the transposed matrix. The $(i, k)$ entry of $\boldsymbol{X}^\top$ equals the $(k, i)$ entry of $\boldsymbol{X}$:
\begin{equation} (\boldsymbol{X}^\top)_{ik} = X_{ki} \label{eq:4-5-1} \end{equation}
Writing the $(i, j)$ entry of $\boldsymbol{X}^\top\boldsymbol{A}$ by definition:
\begin{equation} (\boldsymbol{X}^\top\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} (\boldsymbol{X}^\top)_{ik} A_{kj} = \sum_{k=0}^{K-1} X_{ki} A_{kj} \label{eq:4-5-2} \end{equation}
Differentiating this sum with respect to $X_{mn}$. Since $A_{kj}$ is a constant:
\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \frac{\partial X_{ki}}{\partial X_{mn}} A_{kj} \label{eq:4-5-3} \end{equation}
$X_{ki}$ and $X_{mn}$ are the same variable only when $(k, i) = (m, n)$. Using the Kronecker delta:
\begin{equation} \frac{\partial X_{ki}}{\partial X_{mn}} = \delta_{km} \delta_{in} \label{eq:4-5-4} \end{equation}
Substituting \eqref{eq:4-5-4} into \eqref{eq:4-5-3}:
\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \sum_{k=0}^{K-1} \delta_{km} \delta_{in} A_{kj} \label{eq:4-5-5} \end{equation}
Since $\delta_{in}$ does not depend on $k$, it factors out, and $\sum_{k=0}^{K-1} \delta_{km} A_{kj} = A_{mj}$:
\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} \label{eq:4-5-6} \end{equation}
We show this equals $(\boldsymbol{J}^{nm}\boldsymbol{A})_{ij}$. Since $\boldsymbol{J}^{nm} \in \mathbb{R}^{M \times K}$ has 1 only at position $(n, m)$:
\begin{equation} (\boldsymbol{J}^{nm})_{ik} = \delta_{in} \delta_{km} \label{eq:4-5-7} \end{equation}
Computing the $(i, j)$ entry of $\boldsymbol{J}^{nm}\boldsymbol{A}$:
\begin{equation} (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij} = \sum_{k=0}^{K-1} \delta_{in} \delta_{km} A_{kj} = \delta_{in} A_{mj} \label{eq:4-5-8} \end{equation}
Comparing \eqref{eq:4-5-6} and \eqref{eq:4-5-8}, we obtain the final result:
\begin{equation} \frac{\partial (\boldsymbol{X}^\top\boldsymbol{A})_{ij}}{\partial X_{mn}} = \delta_{in} A_{mj} = (\boldsymbol{J}^{nm}\boldsymbol{A})_{ij} \label{eq:4-5-9} \end{equation}
Figure 2: A change in $X_{mn}$ affects position $(n,m)$ in $\boldsymbol{X}^\top$, thus only row $n$ of $\boldsymbol{X}^\top\boldsymbol{A}$ (effect of $\delta_{in}$)
References
- Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
- Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
- Matrix calculus - Wikipedia