What are derivatives of structured matrices?

Derivatives of structured matrices refer to differentiation of matrices whose entries are subject to constraints, such as symmetric, Toeplitz, or triangular matrices. Unlike standard matrix differentiation where all entries are independent, structured matrices have fewer independent variables due to constraints, requiring the introduction of structure matrices S^{ij} = dA/dA_{ij} and correction via the chain rule.

How does the derivative of a symmetric matrix differ from that of a general matrix?

For a symmetric matrix A = A^T, the constraint A_{ij} = A_{ji} means that changing an off-diagonal entry simultaneously changes its symmetric counterpart. The formula for converting the general gradient G to the symmetric gradient is df/dA|_sym = G + G^T - diag(G).

Why are the vec operator and Kronecker product used in structured matrix differentiation?

The vec operator converts a matrix into a column vector, reducing matrix differentiation to ordinary vector differentiation. Combined with the Kronecker product, identities such as vec(AXB) = (B^T otimes A) vec(X) allow matrix products to be treated as linear transformations on vectors, enabling systematic derivation of structured matrix derivatives using commutation matrices, duplication matrices, and related tools.

What are the advantages of differentiating symmetric matrices via Cholesky decomposition?

Differentiating through the Cholesky decomposition A = LL^T is numerically stable and, particularly for computing the gradient of log|A|, more efficient than directly computing A^{-1}. When L has already been computed during backpropagation, the additional cost is minimal.

Proofs Chapter 13: Derivatives of Structured Matrices

13.1 General Formula for Differentiation via Structure Matrices

Formula: $\displaystyle\frac{df}{dA_{ij}} = \sum_{kl} \displaystyle\frac{\partial f}{\partial A_{kl}} \displaystyle\frac{\partial A_{kl}}{\partial A_{ij}} = \text{tr}\left[\left(\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\right)^\top \boldsymbol{S}^{ij}\right]$

Conditions: $f(\boldsymbol{A})$ is a scalar function, $\boldsymbol{S}^{ij} = \displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}}$ is the structure matrix

Proof

Consider a scalar function $f(\boldsymbol{A})$. When $\boldsymbol{A}$ is a structured matrix, there are constraints between its entries, so changing an independent variable $A_{ij}$ causes other entries to change as well.

Applying the chain rule (1.26), we consider how $f$ changes through all entries $A_{kl}$ that vary when $A_{ij}$ is perturbed.

\begin{equation}\frac{df}{dA_{ij}} = \sum_{k,l} \frac{\partial f}{\partial A_{kl}} \frac{\partial A_{kl}}{\partial A_{ij}} \label{eq:13-1-1}\end{equation}

We define the structure matrix $\boldsymbol{S}^{ij}$. It represents "how the entire matrix $\boldsymbol{A}$ changes when $A_{ij}$ is perturbed by one unit."

\begin{equation}\boldsymbol{S}^{ij} = \frac{\partial \boldsymbol{A}}{\partial A_{ij}} \label{eq:13-1-2}\end{equation}

The $(k,l)$ entry of $\boldsymbol{S}^{ij}$ is given by

\begin{equation}(\boldsymbol{S}^{ij})_{kl} = \frac{\partial A_{kl}}{\partial A_{ij}} \label{eq:13-1-3}\end{equation}

We rewrite $\eqref{eq:13-1-1}$ in matrix form. Introducing the gradient matrix $\boldsymbol{G} = \displaystyle\frac{\partial f}{\partial \boldsymbol{A}}$, we have $G_{kl} = \displaystyle\frac{\partial f}{\partial A_{kl}}$.

\begin{equation}\frac{df}{dA_{ij}} = \sum_{k,l} G_{kl} (\boldsymbol{S}^{ij})_{kl} \label{eq:13-1-4}\end{equation}

The element-wise sum of products of two matrices can be expressed as a trace. In general, $\sum_{k,l} P_{kl} Q_{kl} = \text{tr}(\boldsymbol{P}^\top \boldsymbol{Q})$.

\begin{equation}\sum_{k,l} G_{kl} (\boldsymbol{S}^{ij})_{kl} = \text{tr}(\boldsymbol{G}^\top \boldsymbol{S}^{ij}) \label{eq:13-1-5}\end{equation}

Substituting $\eqref{eq:13-1-5}$ into $\eqref{eq:13-1-4}$, we obtain the final result.

\begin{equation}\frac{df}{dA_{ij}} = \text{tr}\left[\left(\frac{\partial f}{\partial \boldsymbol{A}}\right)^\top \boldsymbol{S}^{ij}\right] \label{eq:13-1-6}\end{equation}

Remark: Equation $\eqref{eq:13-1-6}$ provides a general framework for differentiating structured matrices. The form of $\boldsymbol{S}^{ij}$ is determined by the type of structure (symmetric, triangular, Toeplitz, etc.).

13.2 Structure Matrix for a General Matrix

Formula: $\boldsymbol{S}^{ij} = \displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij}$

Conditions: $\boldsymbol{A}$ is a general matrix with no special structure, $\boldsymbol{J}^{ij}$ is the single-entry matrix with a 1 in position $(i,j)$

Proof

Consider a general matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$. In a general matrix, all entries are independent and there are no constraints between them.

We differentiate the entry $A_{kl}$ of $\boldsymbol{A}$ with respect to the independent variable $A_{ij}$. Since changing $A_{ij}$ does not affect any other entry, we have

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ij}} = \begin{cases} 1 & (k = i \text{ and } l = j) \\ 0 & (\text{otherwise}) \end{cases} \label{eq:13-2-1}\end{equation}

We express $\eqref{eq:13-2-1}$ using the Kronecker delta.

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ij}} = \delta_{ik} \delta_{jl} \label{eq:13-2-2}\end{equation}

where $\delta_{ik}$ equals 1 when $i = k$ and 0 otherwise.

We define the single-entry matrix $\boldsymbol{J}^{ij}$ as the matrix whose only nonzero entry is a 1 in position $(i,j)$.

\begin{equation}(\boldsymbol{J}^{ij})_{kl} = \delta_{ik} \delta_{jl} \label{eq:13-2-3}\end{equation}

Comparing $\eqref{eq:13-2-2}$ and $\eqref{eq:13-2-3}$, they coincide.

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ij}} = (\boldsymbol{J}^{ij})_{kl} \label{eq:13-2-4}\end{equation}

Since $\eqref{eq:13-2-4}$ holds for all $(k,l)$, we obtain the final result in matrix form.

\begin{equation}\boldsymbol{S}^{ij} = \frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij} \label{eq:13-2-5}\end{equation}

Remark: For a general matrix, the structure matrix $\boldsymbol{S}^{ij}$ is simply the single-entry matrix $\boldsymbol{J}^{ij}$. This is the simplest case, reflecting the absence of structural constraints.

13.3 Structure Matrix for a Symmetric Matrix

Formula: $\boldsymbol{S}^{ij} = \displaystyle\frac{\partial \boldsymbol{A}}{\partial A_{ij}} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij}$

Conditions: $\boldsymbol{A}$ is a symmetric matrix ($\boldsymbol{A} = \boldsymbol{A}^\top$)

Proof

Consider a symmetric matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$. The symmetry condition requires that for all $i, j$:

\begin{equation}A_{ij} = A_{ji} \label{eq:13-3-1}\end{equation}

We analyze how the entry $A_{kl}$ of $\boldsymbol{A}$ changes when the independent variable $A_{ij}$ (with $i \leq j$) is perturbed, by considering two cases.

[Case 1] When $i \neq j$ (off-diagonal entry):

Perturbing $A_{ij}$ also changes $A_{ji}$ by the constraint $\eqref{eq:13-3-1}$.

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ij}} = \begin{cases} 1 & (k = i, l = j) \\ 1 & (k = j, l = i) \\ 0 & (\text{otherwise}) \end{cases} \label{eq:13-3-2}\end{equation}

[Case 2] When $i = j$ (diagonal entry):

Perturbing $A_{ii}$ trivially satisfies $A_{ii} = A_{ii}$, and no other entry changes.

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ii}} = \begin{cases} 1 & (k = i, l = i) \\ 0 & (\text{otherwise}) \end{cases} \label{eq:13-3-3}\end{equation}

We write $\eqref{eq:13-3-2}$ and $\eqref{eq:13-3-3}$ in a unified form using the Kronecker delta.

\begin{equation}\frac{\partial A_{kl}}{\partial A_{ij}} = \delta_{ik}\delta_{jl} + \delta_{il}\delta_{jk} - \delta_{ij}\delta_{ik}\delta_{jl} \label{eq:13-3-4}\end{equation}

We verify each term in $\eqref{eq:13-3-4}$:

First term $\delta_{ik}\delta_{jl}$: equals 1 when $(k,l) = (i,j)$
Second term $\delta_{il}\delta_{jk}$: equals 1 when $(k,l) = (j,i)$
Third term $-\delta_{ij}\delta_{ik}\delta_{jl}$: equals $-1$ when $i = j$ and $(k,l) = (i,i)$ (correcting the double count of diagonal entries)

We express $\eqref{eq:13-3-4}$ in matrix form using the definition $(\boldsymbol{J}^{pq})_{kl} = \delta_{pk}\delta_{ql}$.

\begin{equation}(\boldsymbol{S}^{ij})_{kl} = (\boldsymbol{J}^{ij})_{kl} + (\boldsymbol{J}^{ji})_{kl} - \delta_{ij}(\boldsymbol{J}^{ij})_{kl} \label{eq:13-3-5}\end{equation}

Since $\eqref{eq:13-3-5}$ holds for all $(k,l)$, we obtain the final result.

\begin{equation}\boldsymbol{S}^{ij} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij} \label{eq:13-3-6}\end{equation}

Remark: Splitting $\eqref{eq:13-3-6}$ by cases: when $i \neq j$, $\boldsymbol{S}^{ij} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji}$ (two positions change simultaneously); when $i = j$, $\boldsymbol{S}^{ii} = \boldsymbol{J}^{ii}$ (only the diagonal entry changes).

13.4 Differentiation with Respect to a Symmetric Matrix

Formula: $\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}} = \displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}} + \left(\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right)^\top - \text{diag}\left(\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right)$

Conditions: $f(\boldsymbol{A})$ is a scalar function, $\boldsymbol{A}$ is a symmetric matrix

Proof

Consider a scalar function $f(\boldsymbol{A})$. When $\boldsymbol{A}$ is symmetric, the independent variables are only the lower triangular part (including the diagonal).

From 13.1, $\eqref{eq:13-1-6}$, the differentiation formula using the structure matrix is

\begin{equation}\frac{df}{dA_{ij}} = \text{tr}\left[\left(\frac{\partial f}{\partial \boldsymbol{A}}\right)^\top \boldsymbol{S}^{ij}\right] \label{eq:13-4-1}\end{equation}

Let $\boldsymbol{G} = \displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\big|_{\text{gen}}$ denote the gradient computed as if $\boldsymbol{A}$ were a general matrix. Using the trace identity $\text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{pq}) = G_{pq}$:

\begin{equation}\text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{pq}) = \sum_{k,l} G_{kl} (\boldsymbol{J}^{pq})_{kl} = \sum_{k,l} G_{kl} \delta_{pk}\delta_{ql} = G_{pq} \label{eq:13-4-2}\end{equation}

From 13.3, $\eqref{eq:13-3-6}$, the structure matrix for a symmetric matrix is $\boldsymbol{S}^{ij} = \boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij}$. Substituting into $\eqref{eq:13-4-1}$:

\begin{equation}\frac{df}{dA_{ij}} = \text{tr}\left[\boldsymbol{G}^\top (\boldsymbol{J}^{ij} + \boldsymbol{J}^{ji} - \delta_{ij}\boldsymbol{J}^{ij})\right] \label{eq:13-4-3}\end{equation}

By linearity of the trace, we expand $\eqref{eq:13-4-3}$.

\begin{equation}\frac{df}{dA_{ij}} = \text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{ij}) + \text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{ji}) - \delta_{ij}\text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{ij}) \label{eq:13-4-4}\end{equation}

Using $\eqref{eq:13-4-2}$ to evaluate each term:

\begin{equation}\text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{ij}) = G_{ij} \label{eq:13-4-5}\end{equation}

\begin{equation}\text{tr}(\boldsymbol{G}^\top \boldsymbol{J}^{ji}) = G_{ji} \label{eq:13-4-6}\end{equation}

Substituting $\eqref{eq:13-4-5}$ and $\eqref{eq:13-4-6}$ into $\eqref{eq:13-4-4}$:

\begin{equation}\frac{df}{dA_{ij}} = G_{ij} + G_{ji} - \delta_{ij} G_{ij} \label{eq:13-4-7}\end{equation}

We write $\eqref{eq:13-4-7}$ in matrix form. The $(i,j)$ entry of the symmetric derivative $\displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\big|_{\text{sym}}$ is given by $\eqref{eq:13-4-7}$.

\begin{equation}\left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}}\right)_{ij} = G_{ij} + G_{ji} - \delta_{ij} G_{ij} \label{eq:13-4-8}\end{equation}

We express each term in $\eqref{eq:13-4-8}$ as a matrix:

$G_{ij}$ is the $(i,j)$ entry of $\boldsymbol{G}$
$G_{ji}$ is the $(i,j)$ entry of $\boldsymbol{G}^\top$
$\delta_{ij} G_{ij}$ is the $(i,j)$ entry of the diagonal matrix $\text{diag}(\boldsymbol{G})$

Therefore, we obtain the final result.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}} = \boldsymbol{G} + \boldsymbol{G}^\top - \text{diag}(\boldsymbol{G}) \label{eq:13-4-9}\end{equation}

Substituting $\boldsymbol{G} = \displaystyle\frac{\partial f}{\partial \boldsymbol{A}}\big|_{\text{gen}}$:

\begin{equation}\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{sym}} = \frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}} + \left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right)^\top - \text{diag}\left(\frac{\partial f}{\partial \boldsymbol{A}}\bigg|_{\text{gen}}\right) \label{eq:13-4-10}\end{equation}

Remark: Some references present $\frac{1}{2}(\boldsymbol{G} + \boldsymbol{G}^\top)$ as "the gradient for symmetric matrices," but this is merely symmetrization and is not the correct constrained gradient $\eqref{eq:13-4-10}$. Specifically, symmetrization gives diagonal entries $\frac{1}{2}G_{ii} + \frac{1}{2}G_{ii} = G_{ii}$, which happens to coincide with $\eqref{eq:13-4-10}$, but this agreement is coincidental and the derivations differ. For constrained optimization (e.g., covariance matrix estimation), one should use $\eqref{eq:13-4-10}$.

13.5 Vectorization of a Matrix Product

Formula: $\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$

Conditions: $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, $\boldsymbol{X} \in \mathbb{R}^{n \times p}$, $\boldsymbol{B} \in \mathbb{R}^{p \times q}$

Proof

Consider the matrix product $\boldsymbol{C} = \boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$, where $\boldsymbol{C} \in \mathbb{R}^{m \times q}$.

We compute the $(i,j)$ entry of $\boldsymbol{C}$. By the definition of matrix multiplication:

\begin{equation}C_{ij} = \sum_{k=0}^{n-1} \sum_{l=0}^{p-1} A_{ik} X_{kl} B_{lj} \label{eq:13-7-1}\end{equation}

We recall the definition of the vec operator. $\text{vec}(\boldsymbol{X})$ is the $np$-dimensional vector formed by stacking the columns of $\boldsymbol{X}$. The entry $X_{kl}$ occupies position $k + nl$ in $\text{vec}(\boldsymbol{X})$.

\begin{equation}[\text{vec}(\boldsymbol{X})]_{k + nl} = X_{kl} \label{eq:13-7-2}\end{equation}

Similarly, $C_{ij}$ occupies position $i + mj$ in $\text{vec}(\boldsymbol{C})$.

\begin{equation}[\text{vec}(\boldsymbol{C})]_{i + mj} = C_{ij} \label{eq:13-7-3}\end{equation}

We recall the definition of the Kronecker product $\boldsymbol{B}^\top \otimes \boldsymbol{A}$. Since $\boldsymbol{B}^\top \in \mathbb{R}^{q \times p}$ and $\boldsymbol{A} \in \mathbb{R}^{m \times n}$, we have $\boldsymbol{B}^\top \otimes \boldsymbol{A} \in \mathbb{R}^{mq \times np}$.

\begin{equation}\boldsymbol{B}^\top \otimes \boldsymbol{A} = \begin{pmatrix} B_{00} \boldsymbol{A} & B_{10} \boldsymbol{A} & \cdots & B_{p-1,0} \boldsymbol{A} \\ B_{01} \boldsymbol{A} & B_{11} \boldsymbol{A} & \cdots & B_{p-1,1} \boldsymbol{A} \\ \vdots & \vdots & \ddots & \vdots \\ B_{0,q-1} \boldsymbol{A} & B_{1,q-1} \boldsymbol{A} & \cdots & B_{p-1,q-1} \boldsymbol{A} \end{pmatrix} \label{eq:13-7-4}\end{equation}

We express the entries of $\boldsymbol{B}^\top \otimes \boldsymbol{A}$ using indices. The entry in row $i + mj$ ($0 \leq i \leq m-1$, $0 \leq j \leq q-1$) and column $k + nl$ ($0 \leq k \leq n-1$, $0 \leq l \leq p-1$) is

\begin{equation}(\boldsymbol{B}^\top \otimes \boldsymbol{A})_{i + mj,\, k + nl} = (\boldsymbol{B}^\top)_{jl} A_{ik} = B_{lj} A_{ik} \label{eq:13-7-5}\end{equation}

We compute the entries of $(\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$. By the definition of matrix-vector multiplication, the entry at position $i + mj$ is

\begin{equation}[(\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})]_{i + mj} = \sum_{k=0}^{n-1} \sum_{l=0}^{p-1} (\boldsymbol{B}^\top \otimes \boldsymbol{A})_{i + mj,\, k + nl} [\text{vec}(\boldsymbol{X})]_{k + nl} \label{eq:13-7-6}\end{equation}

Substituting $\eqref{eq:13-7-5}$ and $\eqref{eq:13-7-2}$ into $\eqref{eq:13-7-6}$:

\begin{equation}[(\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})]_{i + mj} = \sum_{k=0}^{n-1} \sum_{l=0}^{p-1} B_{lj} A_{ik} X_{kl} \label{eq:13-7-7}\end{equation}

We rearrange the right-hand side of $\eqref{eq:13-7-7}$. Since the factors are scalars, their order can be permuted.

\begin{equation}\sum_{k=0}^{n-1} \sum_{l=0}^{p-1} B_{lj} A_{ik} X_{kl} = \sum_{k=0}^{n-1} \sum_{l=0}^{p-1} A_{ik} X_{kl} B_{lj} \label{eq:13-7-8}\end{equation}

Comparing $\eqref{eq:13-7-1}$ and $\eqref{eq:13-7-8}$, they are identical.

\begin{equation}[(\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})]_{i + mj} = C_{ij} = [\text{vec}(\boldsymbol{C})]_{i + mj} \label{eq:13-7-9}\end{equation}

Since $\eqref{eq:13-7-9}$ holds for all $(i,j)$, the vectors are equal. Therefore, we obtain the final result.

\begin{equation}\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X}) \label{eq:13-7-10}\end{equation}

Remark: Setting $\boldsymbol{B} = \boldsymbol{I}$ gives $\text{vec}(\boldsymbol{A}\boldsymbol{X}) = (\boldsymbol{I} \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$, and setting $\boldsymbol{A} = \boldsymbol{I}$ gives $\text{vec}(\boldsymbol{X}\boldsymbol{B}) = (\boldsymbol{B}^\top \otimes \boldsymbol{I})\,\text{vec}(\boldsymbol{X})$.

13.6 The Commutation Matrix

Formula: $\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}^\top)$ ($\boldsymbol{A} \in \mathbb{R}^{m \times n}$)

Conditions: $\boldsymbol{K}_{mn}$ is an $mn \times mn$ permutation matrix

Proof

Consider a matrix $\boldsymbol{A} \in \mathbb{R}^{m \times n}$. We examine the relationship between the positions of entries in $\text{vec}(\boldsymbol{A})$ and $\text{vec}(\boldsymbol{A}^\top)$.

The $(i,j)$ entry $A_{ij}$ of $\boldsymbol{A}$ occupies position $i + mj$ in $\text{vec}(\boldsymbol{A})$.

\begin{equation}[\text{vec}(\boldsymbol{A})]_{i + mj} = A_{ij} \label{eq:13-8-1}\end{equation}

The $(j,i)$ entry of $\boldsymbol{A}^\top \in \mathbb{R}^{n \times m}$ is $(\boldsymbol{A}^\top)_{ji} = A_{ij}$. This entry occupies position $j + ni$ in $\text{vec}(\boldsymbol{A}^\top)$.

\begin{equation}[\text{vec}(\boldsymbol{A}^\top)]_{j + ni} = A_{ij} \label{eq:13-8-2}\end{equation}

From $\eqref{eq:13-8-1}$ and $\eqref{eq:13-8-2}$, the entry at position $i + mj$ in $\text{vec}(\boldsymbol{A})$ must be moved to position $j + ni$ in $\text{vec}(\boldsymbol{A}^\top)$.

We define the commutation matrix $\boldsymbol{K}_{mn}$. It is an $mn \times mn$ permutation matrix with the property

\begin{equation}(\boldsymbol{K}_{mn})_{j + ni,\, i + mj} = 1 \label{eq:13-8-3}\end{equation}

All other entries are zero.

We compute the entries of $\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A})$. The entry at position $j + ni$ is

\begin{equation}[\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A})]_{j + ni} = \sum_{k=0}^{mn-1} (\boldsymbol{K}_{mn})_{j + ni,\, k} [\text{vec}(\boldsymbol{A})]_k \label{eq:13-8-4}\end{equation}

Since $\boldsymbol{K}_{mn}$ is a permutation matrix with exactly one 1 per row, by $\eqref{eq:13-8-3}$ the only nonzero entry in row $j + ni$ is in column $i + mj$.

\begin{equation}[\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A})]_{j + ni} = [\text{vec}(\boldsymbol{A})]_{i + mj} \label{eq:13-8-5}\end{equation}

Substituting $\eqref{eq:13-8-1}$ into $\eqref{eq:13-8-5}$:

\begin{equation}[\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A})]_{j + ni} = A_{ij} \label{eq:13-8-6}\end{equation}

Comparing $\eqref{eq:13-8-2}$ and $\eqref{eq:13-8-6}$, they coincide.

\begin{equation}[\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A})]_{j + ni} = [\text{vec}(\boldsymbol{A}^\top)]_{j + ni} \label{eq:13-8-7}\end{equation}

Since $\eqref{eq:13-8-7}$ holds for all $(i,j)$ ($0 \leq i \leq m-1$, $0 \leq j \leq n-1$), we obtain the final result.

\begin{equation}\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}^\top) \label{eq:13-8-8}\end{equation}

Remark: We have $\boldsymbol{K}_{mn}^\top = \boldsymbol{K}_{nm}$ and $\boldsymbol{K}_{mn} \boldsymbol{K}_{nm} = \boldsymbol{I}_{mn}$. The commutation matrix $\boldsymbol{K}_{mn}$ is the matrix that "represents transposition in the vectorized world."

13.7 Reordering Kronecker Products

Formula: $\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq}$

Conditions: $\boldsymbol{A} \in \mathbb{R}^{m \times p}$, $\boldsymbol{B} \in \mathbb{R}^{n \times q}$

Proof

For an arbitrary matrix $\boldsymbol{X} \in \mathbb{R}^{p \times q}$, we show equality by applying both sides to $\text{vec}(\boldsymbol{X})$.

[Left-hand side computation]

In 13.5, $\eqref{eq:13-7-10}$, substituting $\boldsymbol{B} \to \boldsymbol{B}^\top$ gives

\begin{equation}\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}^\top) = (\boldsymbol{B} \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X}) \label{eq:13-9-1}\end{equation}

To compute $(\boldsymbol{A} \otimes \boldsymbol{B})\,\text{vec}(\boldsymbol{X})$, we use another form of $\eqref{eq:13-7-10}$. Swapping indices:

\begin{equation}\text{vec}(\boldsymbol{B}\boldsymbol{Y}\boldsymbol{A}^\top) = (\boldsymbol{A} \otimes \boldsymbol{B})\,\text{vec}(\boldsymbol{Y}) \label{eq:13-9-2}\end{equation}

where $\boldsymbol{Y} \in \mathbb{R}^{q \times p}$.

Setting $\boldsymbol{Y} = \boldsymbol{X}^\top$, we have $\text{vec}(\boldsymbol{Y}) = \text{vec}(\boldsymbol{X}^\top)$. By 13.6, $\eqref{eq:13-8-8}$:

\begin{equation}\text{vec}(\boldsymbol{X}^\top) = \boldsymbol{K}_{pq}\,\text{vec}(\boldsymbol{X}) \label{eq:13-9-3}\end{equation}

Substituting $\eqref{eq:13-9-3}$ into $\eqref{eq:13-9-2}$:

\begin{equation}\text{vec}(\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top) = (\boldsymbol{A} \otimes \boldsymbol{B})\boldsymbol{K}_{pq}\,\text{vec}(\boldsymbol{X}) \label{eq:13-9-4}\end{equation}

We apply the commutation matrix to the left-hand side. Setting $\boldsymbol{C} = \boldsymbol{A}\boldsymbol{X}\boldsymbol{B}^\top \in \mathbb{R}^{m \times n}$, we have $\boldsymbol{C}^\top = \boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top \in \mathbb{R}^{n \times m}$.

\begin{equation}\boldsymbol{K}_{mn}\,\text{vec}(\boldsymbol{C}) = \text{vec}(\boldsymbol{C}^\top) = \text{vec}(\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top) \label{eq:13-9-5}\end{equation}

Since $\text{vec}(\boldsymbol{C}) = (\boldsymbol{B} \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X})$ by $\eqref{eq:13-9-1}$, equation $\eqref{eq:13-9-5}$ becomes

\begin{equation}\boldsymbol{K}_{mn}(\boldsymbol{B} \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X}) = \text{vec}(\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top) \label{eq:13-9-6}\end{equation}

[Right-hand side computation]

From $\eqref{eq:13-9-4}$, we have $(\boldsymbol{A} \otimes \boldsymbol{B})\boldsymbol{K}_{pq}\,\text{vec}(\boldsymbol{X}) = \text{vec}(\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{A}^\top)$.

[Comparison]

Comparing the left- and right-hand sides of $\eqref{eq:13-9-6}$ and interchanging $\boldsymbol{A}$ and $\boldsymbol{B}$:

\begin{equation}\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B})\,\text{vec}(\boldsymbol{X}) = \text{vec}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}^\top) \label{eq:13-9-7}\end{equation}

On the other hand, interchanging $\boldsymbol{A}$ and $\boldsymbol{B}$ in $\eqref{eq:13-9-2}$:

\begin{equation}(\boldsymbol{B} \otimes \boldsymbol{A})\,\text{vec}(\boldsymbol{X}^\top) = \text{vec}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}^\top) \label{eq:13-9-8}\end{equation}

Substituting $\eqref{eq:13-9-3}$ into $\eqref{eq:13-9-8}$:

\begin{equation}(\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq}\,\text{vec}(\boldsymbol{X}) = \text{vec}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}^\top) \label{eq:13-9-9}\end{equation}

From $\eqref{eq:13-9-7}$ and $\eqref{eq:13-9-9}$, for all $\boldsymbol{X}$:

\begin{equation}\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B})\,\text{vec}(\boldsymbol{X}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq}\,\text{vec}(\boldsymbol{X}) \label{eq:13-9-10}\end{equation}

Since $\text{vec}(\boldsymbol{X})$ is arbitrary, the matrices are equal.

\begin{equation}\boldsymbol{K}_{mn}(\boldsymbol{A} \otimes \boldsymbol{B}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{pq} \label{eq:13-9-11}\end{equation}

Remark: This formula shows that the order of a Kronecker product can be swapped using commutation matrices. In particular, when $\boldsymbol{A}$ and $\boldsymbol{B}$ are square matrices of the same size, $\boldsymbol{K}_{nn}(\boldsymbol{A} \otimes \boldsymbol{B}) = (\boldsymbol{B} \otimes \boldsymbol{A})\boldsymbol{K}_{nn}$.

13.8 The Duplication Matrix

Formula: $\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A})$ ($\boldsymbol{A}$ is an $n \times n$ symmetric matrix)

Conditions: $\boldsymbol{D}_n$ is an $n^2 \times n(n+1)/2$ matrix, $\text{vech}$ vectorizes the lower triangular part

Proof (constructive)

Consider a symmetric matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$. By symmetry $A_{ij} = A_{ji}$, the independent entries are only those in the lower triangular part (including the diagonal).

We define the $\text{vech}$ operator. $\text{vech}(\boldsymbol{A})$ is the $n(n+1)/2$-dimensional vector formed by stacking the lower triangular entries of $\boldsymbol{A}$ column by column.

\begin{equation}\text{vech}(\boldsymbol{A}) = (A_{00}, A_{10}, \ldots, A_{n-1,0}, A_{11}, A_{21}, \ldots, A_{n-1,1}, \ldots, A_{n-1,n-1})^\top \label{eq:13-10-1}\end{equation}

We determine the position $\nu(i,j)$ of $A_{ij}$ ($i \geq j$) in $\text{vech}(\boldsymbol{A})$. The lower triangular part of column $j$ contains $n - j$ entries: $A_{jj}, A_{j+1,j}, \ldots, A_{n-1,j}$.

\begin{equation}\nu(i,j) = \sum_{k=0}^{j-1}(n - k) + (i - j) = \frac{j(2n - j - 1)}{2} + i \label{eq:13-10-2}\end{equation}

The position of $A_{ij}$ in $\text{vec}(\boldsymbol{A})$ is $i + nj$.

\begin{equation}\mu(i,j) = i + nj \label{eq:13-10-3}\end{equation}

We construct the duplication matrix $\boldsymbol{D}_n \in \mathbb{R}^{n^2 \times n(n+1)/2}$, which places each entry of $\text{vech}(\boldsymbol{A})$ into the appropriate positions of $\text{vec}(\boldsymbol{A})$.

For $A_{ij}$ ($i \geq j$), $\boldsymbol{D}_n$ places a 1 at the following positions:

Row $\mu(i,j) = i + nj$, column $\nu(i,j)$: placing $A_{ij}$ at position $(i,j)$
When $i \neq j$, row $\mu(j,i) = j + ni$, column $\nu(i,j)$: duplicating $A_{ij}$ at position $(j,i)$

We express the entries of $\boldsymbol{D}_n$ as formulas. For the $\nu(i,j)$-th entry ($i \geq j$) of $\text{vech}(\boldsymbol{A})$:

\begin{equation}(\boldsymbol{D}_n)_{\mu(i,j),\, \nu(i,j)} = 1 \label{eq:13-10-4}\end{equation}

\begin{equation}(\boldsymbol{D}_n)_{\mu(j,i),\, \nu(i,j)} = 1 \quad (\text{when } i \neq j) \label{eq:13-10-5}\end{equation}

We compute $\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A})$. The entry at position $\mu(i,j) = i + nj$ is as follows.

[Case 1] When $i \geq j$:

\begin{equation}[\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A})]_{i + nj} = [\text{vech}(\boldsymbol{A})]_{\nu(i,j)} = A_{ij} \label{eq:13-10-6}\end{equation}

[Case 2] When $i < j$ (upper triangular part):

By symmetry $A_{ij} = A_{ji}$, and $A_{ji}$ appears at position $\nu(j,i)$ in $\text{vech}(\boldsymbol{A})$.

\begin{equation}[\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A})]_{i + nj} = [\text{vech}(\boldsymbol{A})]_{\nu(j,i)} = A_{ji} = A_{ij} \label{eq:13-10-7}\end{equation}

From $\eqref{eq:13-10-6}$ and $\eqref{eq:13-10-7}$, for all $(i,j)$:

\begin{equation}[\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A})]_{i + nj} = A_{ij} = [\text{vec}(\boldsymbol{A})]_{i + nj} \label{eq:13-10-8}\end{equation}

Therefore, we obtain the final result.

\begin{equation}\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}) \label{eq:13-10-9}\end{equation}

Remark: $\boldsymbol{D}_n$ has full column rank, so $\boldsymbol{D}_n^\top \boldsymbol{D}_n$ is nonsingular. This reflects the structure in which diagonal entries appear once while off-diagonal entries appear twice.

13.9 The Elimination Matrix

Formula: $\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A})$ ($\boldsymbol{A}$ is an $n \times n$ symmetric matrix)

Conditions: $\boldsymbol{L}_n$ is an $n(n+1)/2 \times n^2$ matrix

Proof (constructive)

The elimination matrix $\boldsymbol{L}_n$ extracts only the lower triangular part from $\text{vec}(\boldsymbol{A})$.

We use the position functions $\nu(i,j)$ and $\mu(i,j)$ defined in 13.8.

\begin{equation}\nu(i,j) = \frac{j(2n - j - 1)}{2} + i \quad (i \geq j) \label{eq:13-11-1}\end{equation}

\begin{equation}\mu(i,j) = i + nj \label{eq:13-11-2}\end{equation}

We construct $\boldsymbol{L}_n \in \mathbb{R}^{n(n+1)/2 \times n^2}$. The $\nu(i,j)$-th entry ($i \geq j$) of $\text{vech}(\boldsymbol{A})$ is obtained from the $\mu(i,j)$-th entry of $\text{vec}(\boldsymbol{A})$.

\begin{equation}(\boldsymbol{L}_n)_{\nu(i,j),\, \mu(i,j)} = 1 \quad (i \geq j) \label{eq:13-11-3}\end{equation}

All other entries are zero.

We compute $\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A})$. The entry at position $\nu(i,j)$ ($i \geq j$) is

\begin{equation}[\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A})]_{\nu(i,j)} = \sum_{k=0}^{n^2-1} (\boldsymbol{L}_n)_{\nu(i,j),\, k} [\text{vec}(\boldsymbol{A})]_k \label{eq:13-11-4}\end{equation}

$\boldsymbol{L}_n$ has exactly one 1 per row. By $\eqref{eq:13-11-3}$, row $\nu(i,j)$ has a 1 only in column $\mu(i,j)$.

\begin{equation}[\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A})]_{\nu(i,j)} = [\text{vec}(\boldsymbol{A})]_{\mu(i,j)} \label{eq:13-11-5}\end{equation}

By $\eqref{eq:13-11-2}$, $[\text{vec}(\boldsymbol{A})]_{\mu(i,j)} = [\text{vec}(\boldsymbol{A})]_{i + nj} = A_{ij}$.

\begin{equation}[\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A})]_{\nu(i,j)} = A_{ij} \label{eq:13-11-6}\end{equation}

By the definition in $\eqref{eq:13-10-1}$, $[\text{vech}(\boldsymbol{A})]_{\nu(i,j)} = A_{ij}$. Therefore:

\begin{equation}[\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A})]_{\nu(i,j)} = [\text{vech}(\boldsymbol{A})]_{\nu(i,j)} \label{eq:13-11-7}\end{equation}

Since $\eqref{eq:13-11-7}$ holds for all $(i,j)$ ($i \geq j$), we obtain the final result.

\begin{equation}\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A}) \label{eq:13-11-8}\end{equation}

Remark: $\boldsymbol{L}_n$ is also called a "selection matrix." It selects $n(n+1)/2$ entries of the lower triangular part from the $n^2$ entries of $\text{vec}(\boldsymbol{A})$. The upper triangular entries are redundant due to symmetry and are therefore discarded.

13.10 Relationship Between Elimination and Duplication Matrices

Formula: $\boldsymbol{L}_n \boldsymbol{D}_n = \boldsymbol{I}_{n(n+1)/2}$

Conditions: $\boldsymbol{L}_n$ and $\boldsymbol{D}_n$ are as defined above

Proof

Consider an arbitrary symmetric matrix $\boldsymbol{A} \in \mathbb{R}^{n \times n}$.

From 13.8, $\eqref{eq:13-10-9}$, the property of the duplication matrix is

\begin{equation}\boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vec}(\boldsymbol{A}) \label{eq:13-12-1}\end{equation}

From 13.9, $\eqref{eq:13-11-8}$, the property of the elimination matrix is

\begin{equation}\boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A}) \label{eq:13-12-2}\end{equation}

Left-multiplying both sides of $\eqref{eq:13-12-1}$ by $\boldsymbol{L}_n$:

\begin{equation}\boldsymbol{L}_n \boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \boldsymbol{L}_n\,\text{vec}(\boldsymbol{A}) \label{eq:13-12-3}\end{equation}

Substituting $\eqref{eq:13-12-2}$ into the right-hand side of $\eqref{eq:13-12-3}$:

\begin{equation}\boldsymbol{L}_n \boldsymbol{D}_n\,\text{vech}(\boldsymbol{A}) = \text{vech}(\boldsymbol{A}) \label{eq:13-12-4}\end{equation}

Rearranging $\eqref{eq:13-12-4}$:

\begin{equation}(\boldsymbol{L}_n \boldsymbol{D}_n - \boldsymbol{I}_{n(n+1)/2})\,\text{vech}(\boldsymbol{A}) = \boldsymbol{0} \label{eq:13-12-5}\end{equation}

Since $\text{vech}(\boldsymbol{A})$ corresponds to an arbitrary symmetric matrix, it can be any vector in $\mathbb{R}^{n(n+1)/2}$.

For $\eqref{eq:13-12-5}$ to hold for all $\text{vech}(\boldsymbol{A})$, the coefficient matrix must be the zero matrix.

\begin{equation}\boldsymbol{L}_n \boldsymbol{D}_n - \boldsymbol{I}_{n(n+1)/2} = \boldsymbol{O} \label{eq:13-12-6}\end{equation}

Rearranging $\eqref{eq:13-12-6}$, we obtain the final result.

\begin{equation}\boldsymbol{L}_n \boldsymbol{D}_n = \boldsymbol{I}_{n(n+1)/2} \label{eq:13-12-7}\end{equation}

Remark: By $\eqref{eq:13-12-7}$, $\boldsymbol{L}_n$ is a left inverse of $\boldsymbol{D}_n$. Note, however, that $\boldsymbol{D}_n \boldsymbol{L}_n \neq \boldsymbol{I}_{n^2}$. The product $\boldsymbol{D}_n \boldsymbol{L}_n$ is an $n^2 \times n^2$ matrix representing the operation of extracting the symmetric part of a general matrix and duplicating it.

13.11 Derivative of Vectorization

Formula: $\displaystyle\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = \boldsymbol{I}_{mn}$ ($\boldsymbol{X} \in \mathbb{R}^{m \times n}$)

Conditions: $\text{vec}(\boldsymbol{X})$ is an $mn$-dimensional vector

Proof

Consider a matrix $\boldsymbol{X} \in \mathbb{R}^{m \times n}$. $\text{vec}(\boldsymbol{X})$ is an $mn$-dimensional vector.

Let the entries of $\text{vec}(\boldsymbol{X})$ be $y_k$ ($k = 0, \ldots, mn-1$). By the definition of the vec operator, $y_k$ corresponds to an entry of $\boldsymbol{X}$ as follows:

\begin{equation}y_k = X_{ij} \quad \text{where } k = i + mj \label{eq:13-13-1}\end{equation}

Similarly, let the entries of $\text{vec}(\boldsymbol{X})$ as independent variables be $x_l$ ($l = 0, \ldots, mn-1$).

\begin{equation}x_l = X_{pq} \quad \text{where } l = p + mq \label{eq:13-13-2}\end{equation}

From $\eqref{eq:13-13-1}$ and $\eqref{eq:13-13-2}$, $y_k = x_k$. That is, each entry of $\text{vec}(\boldsymbol{X})$ is the identity map of itself as a variable.

\begin{equation}y_k = x_k \quad (k = 0, \ldots, mn-1) \label{eq:13-13-3}\end{equation}

We differentiate $y_k$ with respect to $x_l$. By $\eqref{eq:13-13-3}$:

\begin{equation}\frac{\partial y_k}{\partial x_l} = \frac{\partial x_k}{\partial x_l} \label{eq:13-13-4}\end{equation}

Since $x_k$ and $x_l$ are independent variables, the partial derivative is the Kronecker delta.

\begin{equation}\frac{\partial x_k}{\partial x_l} = \delta_{kl} = \begin{cases} 1 & (k = l) \\ 0 & (k \neq l) \end{cases} \label{eq:13-13-5}\end{equation}

Substituting $\eqref{eq:13-13-5}$ into $\eqref{eq:13-13-4}$:

\begin{equation}\frac{\partial y_k}{\partial x_l} = \delta_{kl} \label{eq:13-13-6}\end{equation}

The $(k,l)$ entry of the derivative matrix $\displaystyle\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top}$ is $\displaystyle\frac{\partial y_k}{\partial x_l}$.

\begin{equation}\left(\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top}\right)_{kl} = \frac{\partial y_k}{\partial x_l} = \delta_{kl} \label{eq:13-13-7}\end{equation}

$\delta_{kl}$ is the $(k,l)$ entry of the identity matrix $\boldsymbol{I}_{mn}$.

\begin{equation}(\boldsymbol{I}_{mn})_{kl} = \delta_{kl} \label{eq:13-13-8}\end{equation}

From $\eqref{eq:13-13-7}$ and $\eqref{eq:13-13-8}$, the entries coincide for all $(k,l)$, so the matrices are equal.

\begin{equation}\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = \boldsymbol{I}_{mn} \label{eq:13-13-9}\end{equation}

Remark: Although $\eqref{eq:13-13-9}$ may seem trivial, it is a fundamental identity used in complex matrix differentiation computations via the chain rule. For example, $\displaystyle\frac{\partial\,\text{vec}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = (\boldsymbol{B}^\top \otimes \boldsymbol{A}) \displaystyle\frac{\partial\,\text{vec}(\boldsymbol{X})}{\partial\,\text{vec}(\boldsymbol{X})^\top} = \boldsymbol{B}^\top \otimes \boldsymbol{A}$.

13.12 Gradient of the Cholesky Decomposition

Formula: For a positive definite matrix $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$ ($\boldsymbol{L}$: lower triangular), let $\boldsymbol{\Phi} = \boldsymbol{L}^\top \bar{\boldsymbol{L}}$. Then $$\frac{\partial \mathcal{L}}{\partial \boldsymbol{A}} = \boldsymbol{L}^{-\top}\!\left(\text{tril}(\boldsymbol{\Phi}) - \tfrac{1}{2}\text{diag}(\boldsymbol{\Phi})\right)\boldsymbol{L}^{-1}$$ where $\bar{\boldsymbol{L}} = \partial \mathcal{L}/\partial \boldsymbol{L}$ is the upstream gradient with respect to $\boldsymbol{L}$.

Conditions: $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ is symmetric positive definite, $\boldsymbol{L}$ is a lower triangular matrix with positive diagonal entries, $\bar{\boldsymbol{L}} = \partial \mathcal{L}/\partial \boldsymbol{L}$

Proof

We seek the gradient of the loss function $\mathcal{L}$ with respect to $\boldsymbol{A}$ via the Cholesky decomposition. By the chain rule:

\begin{equation}\frac{\partial \mathcal{L}}{\partial A_{ij}} = \sum_{k,l} \frac{\partial \mathcal{L}}{\partial L_{kl}} \frac{\partial L_{kl}}{\partial A_{ij}} \label{eq:13-14-1}\end{equation}

We differentiate $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$. Taking differentials of both sides:

\begin{equation}d\boldsymbol{A} = (d\boldsymbol{L})\,\boldsymbol{L}^\top + \boldsymbol{L}\,(d\boldsymbol{L})^\top \label{eq:13-14-2}\end{equation}

Setting $\boldsymbol{S} = \boldsymbol{L}^{-1}\,(d\boldsymbol{A})\,\boldsymbol{L}^{-\top}$ and multiplying $\eqref{eq:13-14-2}$ on the left by $\boldsymbol{L}^{-1}$ and on the right by $\boldsymbol{L}^{-\top}$:

\begin{equation}\boldsymbol{S} = \boldsymbol{L}^{-1}(d\boldsymbol{L}) + (\boldsymbol{L}^{-1}(d\boldsymbol{L}))^\top \label{eq:13-14-3}\end{equation}

$\boldsymbol{\Omega} = \boldsymbol{L}^{-1}(d\boldsymbol{L})$ is lower triangular (since both $\boldsymbol{L}^{-1}$ and $d\boldsymbol{L}$ are lower triangular). From $\eqref{eq:13-14-3}$, $\boldsymbol{S} = \boldsymbol{\Omega} + \boldsymbol{\Omega}^\top$, so

\begin{equation}\boldsymbol{\Omega} = \text{tril}(\boldsymbol{S}) - \tfrac{1}{2}\text{diag}(\boldsymbol{S}) \label{eq:13-14-4}\end{equation}

Here $\text{tril}$ extracts the lower triangular part (including the diagonal) and $\text{diag}$ extracts only the diagonal entries. Since $\boldsymbol{S}$ is symmetric, the diagonal entries of $\text{tril}(\boldsymbol{S})$ are $S_{ii}$, while the diagonal entries of $\boldsymbol{\Omega} + \boldsymbol{\Omega}^\top$ are $2\Omega_{ii}$, giving $\Omega_{ii} = \frac{1}{2}S_{ii}$, which confirms $\eqref{eq:13-14-4}$.

Using $d\boldsymbol{L} = \boldsymbol{L}\boldsymbol{\Omega}$, the differential of the scalar loss $\mathcal{L}$ is

\begin{equation}d\mathcal{L} = \text{tr}\!\bigl(\bar{\boldsymbol{L}}^\top\, d\boldsymbol{L}\bigr) = \text{tr}\!\bigl(\bar{\boldsymbol{L}}^\top \boldsymbol{L}\boldsymbol{\Omega}\bigr) \label{eq:13-14-5}\end{equation}

Setting $\boldsymbol{\Phi} = \boldsymbol{L}^\top \bar{\boldsymbol{L}}$. Since $\boldsymbol{\Omega}$ is lower triangular, $\text{tr}(\boldsymbol{\Phi}\boldsymbol{\Omega}) = \text{tr}(\text{tril}(\boldsymbol{\Phi})\,\boldsymbol{\Omega})$. Substituting $\eqref{eq:13-14-4}$:

\begin{equation}d\mathcal{L} = \text{tr}\!\bigl(\text{tril}(\boldsymbol{\Phi})\,(\text{tril}(\boldsymbol{S}) - \tfrac{1}{2}\text{diag}(\boldsymbol{S}))\bigr) \label{eq:13-14-6}\end{equation}

Substituting $\boldsymbol{S} = \boldsymbol{L}^{-1}\,(d\boldsymbol{A})\,\boldsymbol{L}^{-\top}$ and rearranging into the form $d\mathcal{L} = \text{tr}\!\bigl(\bar{\boldsymbol{A}}^\top\,d\boldsymbol{A}\bigr)$, for symmetric $d\boldsymbol{A}$:

\begin{equation}\bar{\boldsymbol{A}} = \boldsymbol{L}^{-\top}\,\text{tril}\!\bigl(\boldsymbol{L}^\top \bar{\boldsymbol{L}}\bigr)\,\boldsymbol{L}^{-1} \label{eq:13-14-7}\end{equation}

Since $\boldsymbol{A}$ is symmetric, contributions to diagonal entries are double-counted. Applying the diagonal correction, the final formula is:

\begin{equation}\bar{\boldsymbol{A}} = \boldsymbol{L}^{-\top}\!\left(\text{tril}(\boldsymbol{\Phi}) - \tfrac{1}{2}\text{diag}(\boldsymbol{\Phi})\right)\boldsymbol{L}^{-1} \label{eq:13-14-7b}\end{equation}

Remark: This formula is frequently used for computing gradients of covariance matrix parameters in Gaussian processes and variational inference. In implementation, $\boldsymbol{L}^{-1}$ is not explicitly computed; instead, triangular solves (forward/backward substitution) are used for efficiency.
References: Murray, I. (2016). Differentiation of the Cholesky decomposition. arXiv:1602.07527.

13.13 Derivative of the Log-Determinant via Cholesky Decomposition

Formula: For a positive definite matrix $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$, $$\frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{A}} = \boldsymbol{A}^{-\top}$$

Conditions: $\boldsymbol{A} \in \mathbb{R}^{n \times n}$ is symmetric positive definite, $\boldsymbol{L}$ is the Cholesky factor

Proof

For a symmetric positive definite matrix $\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top$, we express the determinant in terms of the Cholesky factor.

\begin{equation}|\boldsymbol{A}| = |\boldsymbol{L}\boldsymbol{L}^\top| = |\boldsymbol{L}|^2 \label{eq:13-15-1}\end{equation}

Since $\boldsymbol{L}$ is lower triangular, $|\boldsymbol{L}| = \prod_{i=0}^{n-1} L_{ii}$. Taking the logarithm of $\eqref{eq:13-15-1}$:

\begin{equation}\log|\boldsymbol{A}| = 2\log|\boldsymbol{L}| = 2\sum_{i=0}^{n-1} \log L_{ii} \label{eq:13-15-2}\end{equation}

Differentiating $\eqref{eq:13-15-2}$ with respect to $L_{ii}$:

\begin{equation}\frac{\partial \log|\boldsymbol{A}|}{\partial L_{ii}} = \frac{2}{L_{ii}} \label{eq:13-15-3}\end{equation}

For off-diagonal entries:

\begin{equation}\frac{\partial \log|\boldsymbol{A}|}{\partial L_{ij}} = 0 \quad (i \neq j) \label{eq:13-15-4}\end{equation}

Therefore, the gradient with respect to $\boldsymbol{L}$ is

\begin{equation}\bar{\boldsymbol{L}} = \frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{L}} = 2\,\text{diag}(L_{00}^{-1}, \ldots, L_{n-1,n-1}^{-1}) \label{eq:13-15-5}\end{equation}

Using the result from 13.12, we compute the gradient with respect to $\boldsymbol{A}$. We find $\boldsymbol{\Phi} = \boldsymbol{L}^\top \bar{\boldsymbol{L}}$. Since $\bar{\boldsymbol{L}}$ is diagonal:

\begin{equation}\boldsymbol{\Phi} = \boldsymbol{L}^\top \cdot 2\,\text{diag}(L_{ii}^{-1}) = 2\,\boldsymbol{L}^\top\,\text{diag}(L_{ii}^{-1}) \label{eq:13-15-6}\end{equation}

Since $\boldsymbol{L}^\top$ is upper triangular and $\text{diag}(L_{ii}^{-1})$ is diagonal, $\boldsymbol{\Phi}$ is upper triangular. Hence $\text{tril}(\boldsymbol{\Phi})$ retains only the diagonal entries.

\begin{equation}\text{tril}(\boldsymbol{\Phi})_{ii} = \Phi_{ii} = (L^\top)_{ii} \cdot 2L_{ii}^{-1} = L_{ii} \cdot 2L_{ii}^{-1} = 2 \label{eq:13-15-7}\end{equation}

Applying the diagonal correction from $\eqref{eq:13-14-7b}$: $\text{tril}(\boldsymbol{\Phi}) - \frac{1}{2}\text{diag}(\boldsymbol{\Phi})$. Since the diagonal of $\text{tril}(\boldsymbol{\Phi})$ is $2$, we get $2 - \frac{1}{2} \cdot 2 = 1$. That is:

\begin{equation}\text{tril}(\boldsymbol{\Phi}) - \tfrac{1}{2}\text{diag}(\boldsymbol{\Phi}) = \boldsymbol{I}_n \label{eq:13-15-8}\end{equation}

Substituting into the formula $\eqref{eq:13-14-7b}$ from 13.12:

\begin{equation}\frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{A}} = \boldsymbol{L}^{-\top}\,\boldsymbol{I}_n\,\boldsymbol{L}^{-1} = \boldsymbol{L}^{-\top}\boldsymbol{L}^{-1} = (\boldsymbol{L}\boldsymbol{L}^\top)^{-1} = \boldsymbol{A}^{-1} \label{eq:13-15-9}\end{equation}

Since $\boldsymbol{A}$ is symmetric, $\boldsymbol{A}^{-\top} = \boldsymbol{A}^{-1}$, so

\begin{equation}\frac{\partial \log|\boldsymbol{A}|}{\partial \boldsymbol{A}} = \boldsymbol{A}^{-\top} \label{eq:13-15-10}\end{equation}

Remark: Equation $\eqref{eq:13-15-10}$ can also be derived directly from Jacobi's formula $\frac{\partial |\boldsymbol{A}|}{\partial \boldsymbol{A}} = |\boldsymbol{A}|\boldsymbol{A}^{-\top}$ (by applying $\frac{d}{dx}\log f(x) = f'(x)/f(x)$). Here we presented the derivation via 13.12 to verify consistency with the Cholesky decomposition gradient. Numerically, the Cholesky-based approach is more stable, and it is particularly efficient when $\boldsymbol{L}$ has already been computed during backpropagation.

Proofs Chapter 13: Derivatives of Structured Matrices

13. Derivatives of Structured Matrices

Fundamentals of Structured Matrices

13.1 General Formula for Differentiation via Structure Matrices

Proof

13.2 Structure Matrix for a General Matrix

Proof

13.3 Structure Matrix for a Symmetric Matrix

Proof

13.4 Differentiation with Respect to a Symmetric Matrix

Proof

13.1 The vec Operator and Related Matrices

13.5 Vectorization of a Matrix Product

Proof

13.6 The Commutation Matrix

Proof

13.7 Reordering Kronecker Products

Proof

13.8 The Duplication Matrix

Proof (constructive)

13.9 The Elimination Matrix

Proof (constructive)

13.10 Relationship Between Elimination and Duplication Matrices

Proof

13.11 Derivative of Vectorization

Proof

13.2 Gradients via Cholesky Decomposition

13.12 Gradient of the Cholesky Decomposition

Proof

13.13 Derivative of the Log-Determinant via Cholesky Decomposition

Proof

References