5. Trace Derivatives
Prerequisites for this chapter
Unless otherwise stated, all formulas in this chapter hold under the following conditions:
- All formulas use the denominator layout convention
- The derivative of a scalar $f$ with respect to a matrix $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ yields $\frac{\partial f}{\partial \boldsymbol{X}} \in \mathbb{R}^{M \times N}$
- The trace is defined only for square matrices
When the matrix $\boldsymbol{X}$ is an $N \times N$ square matrix, there exist differentiation formulas involving the trace (sum of diagonal elements).
Here we present the relevant formulas from the denominator layout perspective.
Definition of Trace
\begin{eqnarray}
\text{tr}(\boldsymbol{X}) = \displaystyle\sum_{i=0}^{N-1} X_{ii}
\end{eqnarray}
Relationship between Quadratic Forms and Trace
A quadratic form can be expressed using the trace.
\begin{eqnarray}
\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}
&=&
\text{tr}(\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x})
=
\text{tr}(\boldsymbol{A} \boldsymbol{x} \boldsymbol{x}^\top)
\end{eqnarray}
This follows from the fact that $\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}$ is a scalar and the cyclic property of trace (1.12):
$\text{tr}(\boldsymbol{ABC}) = \text{tr}(\boldsymbol{CAB})$.
Significance of rewriting in trace form
By expressing scalar-valued functions using the trace, differentiation can be handled uniformly as matrix operations.
This is a notational device for systematizing multivariable differentiation and does not change the value itself.
Inner Product and Trace
The inner product of vectors can also be expressed using the trace.
\begin{eqnarray}
\boldsymbol{a}^\top \boldsymbol{x}
&=&
\text{tr}(\boldsymbol{a}^\top \boldsymbol{x})
=
\text{tr}(\boldsymbol{x} \boldsymbol{a}^\top)
\end{eqnarray}
See Trace Derivatives for a list of formulas.
We prove each formula below. Let $\boldsymbol{X}$ be an $N \times M$ matrix;
in the denominator layout, the result is also an $N \times M$ matrix.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix, $\text{tr}(\boldsymbol{X}) \in \mathbb{R}$ is a scalar
Proof
We recall the definition of trace. The trace is the sum of the diagonal elements of a square matrix.
\begin{equation}
\text{tr}(\boldsymbol{X}) = \sum_{i=0}^{N-1} X_{ii}
\label{eq:5-1-1}
\end{equation}
Differentiating this scalar with respect to the $(j, l)$ entry $X_{jl}$ of $\boldsymbol{X}$:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}) = \frac{\partial}{\partial X_{jl}} \sum_{i=0}^{N-1} X_{ii} = \sum_{i=0}^{N-1} \frac{\partial X_{ii}}{\partial X_{jl}}
\label{eq:5-1-2}
\end{equation}
$X_{ii}$ and $X_{jl}$ are the same variable only when $i = j$ and $i = l$, i.e., $j = l$. Using the Kronecker delta:
\begin{equation}
\frac{\partial X_{ii}}{\partial X_{jl}} = \delta_{ij} \delta_{il}
\label{eq:5-1-3}
\end{equation}
Substituting \eqref{eq:5-1-3} into \eqref{eq:5-1-2} and summing over $i$. Since $\delta_{ij} = 1$ only when $i = j$:
\begin{equation}
\sum_{i=0}^{N-1} \delta_{ij} \delta_{il} = \delta_{jl}
\label{eq:5-1-4}
\end{equation}
Since $\delta_{jl}$ is the $(j, l)$ entry of the identity matrix $\boldsymbol{I}$:
\begin{equation}
\delta_{jl} = I_{jl}
\label{eq:5-1-5}
\end{equation}
Since \eqref{eq:5-1-5} holds for all $(j, l)$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I}
\label{eq:5-1-6}
\end{equation}
Remark: Since the trace is the sum of diagonal elements, differentiating with respect to a diagonal element $X_{jj}$ yields 1, while differentiating with respect to an off-diagonal element yields 0. This is why the result is the identity matrix $\boldsymbol{I}$ (diagonal entries equal to 1, off-diagonal entries equal to 0).
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top$
Conditions: $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{A}\boldsymbol{X} \in \mathbb{R}^{M \times M}$ is a square matrix
Proof
Writing out the $(i, i)$ entry (diagonal entry) of the matrix product $\boldsymbol{A}\boldsymbol{X}$:
\begin{equation}
(\boldsymbol{A}\boldsymbol{X})_{ii} = \sum_{k=0}^{N-1} A_{ik} X_{ki}
\label{eq:5-2-1}
\end{equation}
Since the trace is the sum of diagonal elements:
\begin{equation}
\text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} (\boldsymbol{A}\boldsymbol{X})_{ii} = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} X_{ki}
\label{eq:5-2-2}
\end{equation}
Differentiating this scalar with respect to the $(j, l)$ entry $X_{jl}$ of $\boldsymbol{X}$. Since $A_{ik}$ is a constant:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \frac{\partial X_{ki}}{\partial X_{jl}}
\label{eq:5-2-3}
\end{equation}
$X_{ki}$ and $X_{jl}$ are the same variable only when $(k, i) = (j, l)$. Using the Kronecker delta:
\begin{equation}
\frac{\partial X_{ki}}{\partial X_{jl}} = \delta_{kj} \delta_{il}
\label{eq:5-2-4}
\end{equation}
Substituting \eqref{eq:5-2-4} into \eqref{eq:5-2-3}:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \delta_{kj} \delta_{il}
\label{eq:5-2-5}
\end{equation}
Since $\delta_{kj} = 1$ only when $k = j$, we get $\sum_{k=0}^{N-1} A_{ik} \delta_{kj} = A_{ij}$. Similarly, since $\delta_{il} = 1$ only when $i = l$:
\begin{equation}
\sum_{i=0}^{M-1} A_{ij} \delta_{il} = A_{lj}
\label{eq:5-2-6}
\end{equation}
$A_{lj}$ is the $(j, l)$ entry of the transpose $\boldsymbol{A}^\top$:
\begin{equation}
A_{lj} = (\boldsymbol{A}^\top)_{jl}
\label{eq:5-2-7}
\end{equation}
Since \eqref{eq:5-2-7} holds for all $(j, l)$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top
\label{eq:5-2-8}
\end{equation}
Remark: The transpose appears because, when summing diagonal elements in the definition of trace, the row index of $\boldsymbol{A}$ coincides with the column index of $\boldsymbol{X}$. In the derivative, the indices swap, yielding the transpose. When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), we have $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{A}^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X}\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is a square matrix
Proof
Method 1: Using the cyclic property of trace
By the cyclic property of trace, for any matrices $\boldsymbol{P}, \boldsymbol{Q}$ we have $\text{tr}(\boldsymbol{P}\boldsymbol{Q}) = \text{tr}(\boldsymbol{Q}\boldsymbol{P})$. Applying this to $\boldsymbol{X}\boldsymbol{A}$:
\begin{equation}
\text{tr}(\boldsymbol{X}\boldsymbol{A}) = \text{tr}(\boldsymbol{A}\boldsymbol{X})
\label{eq:5-3-1}
\end{equation}
Applying the result of Formula 5.2:
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top
\label{eq:5-3-2}
\end{equation}
Method 2: Direct computation
Writing out the $(i, i)$ entry of the matrix product $\boldsymbol{X}\boldsymbol{A}$:
\begin{equation}
(\boldsymbol{X}\boldsymbol{A})_{ii} = \sum_{k=0}^{M-1} X_{ik} A_{ki}
\label{eq:5-3-3}
\end{equation}
Since the trace is the sum of diagonal elements:
\begin{equation}
\text{tr}(\boldsymbol{X}\boldsymbol{A}) = \sum_{i=0}^{N-1} \sum_{k=0}^{M-1} X_{ik} A_{ki}
\label{eq:5-3-4}
\end{equation}
Differentiating with respect to $X_{jl}$ and substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$, then summing:
\begin{equation}
\sum_{i=0}^{N-1} \sum_{k=0}^{M-1} \delta_{ij} \delta_{kl} A_{ki} = A_{lj} = (\boldsymbol{A}^\top)_{jl}
\label{eq:5-3-5}
\end{equation}
Remark: By the cyclic property of trace, $\text{tr}(\boldsymbol{X}\boldsymbol{A})$ and $\text{tr}(\boldsymbol{A}\boldsymbol{X})$ have the same value. Therefore, their derivatives are also the same.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}$
Conditions: $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$, $\boldsymbol{A}\boldsymbol{X}^\top \in \mathbb{R}^{M \times M}$ is a square matrix
Proof
We verify the entries of the transpose. The $(k, i)$ entry of $\boldsymbol{X}^\top$ equals the $(i, k)$ entry of $\boldsymbol{X}$.
\begin{equation}
(\boldsymbol{X}^\top)_{ki} = X_{ik}
\label{eq:5-4-1}
\end{equation}
Writing out the $(i, i)$ entry of the matrix product $\boldsymbol{A}\boldsymbol{X}^\top$:
\begin{equation}
(\boldsymbol{A}\boldsymbol{X}^\top)_{ii} = \sum_{k=0}^{N-1} A_{ik} (\boldsymbol{X}^\top)_{ki} = \sum_{k=0}^{N-1} A_{ik} X_{ik}
\label{eq:5-4-2}
\end{equation}
Since the trace is the sum of diagonal elements:
\begin{equation}
\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} X_{ik}
\label{eq:5-4-3}
\end{equation}
Differentiating this expression with respect to $X_{jl}$. Since $A_{ik}$ is a constant:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \frac{\partial X_{ik}}{\partial X_{jl}}
\label{eq:5-4-4}
\end{equation}
Since $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$ (equals 1 only when $(i, k) = (j, l)$), substituting:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \delta_{ij} \delta_{kl}
\label{eq:5-4-5}
\end{equation}
Summing over $i$ with $\delta_{ij}$, only the $i = j$ term survives; summing over $k$ with $\delta_{kl}$, only the $k = l$ term survives:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = A_{jl}
\label{eq:5-4-6}
\end{equation}
Since \eqref{eq:5-4-6} holds for all $(j, l)$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}
\label{eq:5-4-7}
\end{equation}
Remark: Compared with the result of
5.2, replacing $\boldsymbol{X}$ by $\boldsymbol{X}^\top$ removes the transpose from the result. This is related to the fact that $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i,k} A_{ik} X_{ik}$ equals the Frobenius inner product $\langle \boldsymbol{A}, \boldsymbol{X} \rangle_F$ of $\boldsymbol{A}$ and $\boldsymbol{X}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$, $\boldsymbol{A} \in \mathbb{R}^{N \times M}$ is an $N \times M$ constant matrix, $\boldsymbol{X}^\top\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is a square matrix
Proof
Applying the cyclic property of trace. Setting $\boldsymbol{P} = \boldsymbol{X}^\top$ and $\boldsymbol{Q} = \boldsymbol{A}$ in $\text{tr}(\boldsymbol{P}\boldsymbol{Q}) = \text{tr}(\boldsymbol{Q}\boldsymbol{P})$:
\begin{equation}
\text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top)
\label{eq:5-5-1}
\end{equation}
For $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A})$ to be defined, $\boldsymbol{X}^\top\boldsymbol{A}$ must be a square matrix. Since $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$ and $\boldsymbol{A} \in \mathbb{R}^{M \times N}$, we have $\boldsymbol{X}^\top\boldsymbol{A} \in \mathbb{R}^{N \times N}$, so the trace is defined.
In this case, $\boldsymbol{A}\boldsymbol{X}^\top \in \mathbb{R}^{M \times M}$, and by the cyclic property both sides have the same scalar value. Applying the result of Formula 5.4:
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}
\label{eq:5-5-2}
\end{equation}
Remark: This formula is frequently used in machine learning. For example, when $\boldsymbol{A}$ is a label matrix and $\boldsymbol{X}$ is a prediction matrix, $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A})$ represents the sum of inner products between predictions and labels, and its gradient is $\boldsymbol{A}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix variable
Proof
Writing out the $(i, i)$ entry of $\boldsymbol{X}^2 = \boldsymbol{X} \cdot \boldsymbol{X}$:
\begin{equation}
(\boldsymbol{X}^2)_{ii} = \sum_{k=0}^{N-1} X_{ik} X_{ki}
\label{eq:5-6-1}
\end{equation}
Since the trace is the sum of diagonal elements:
\begin{equation}
\text{tr}(\boldsymbol{X}^2) = \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} X_{ki}
\label{eq:5-6-2}
\end{equation}
Differentiating with respect to $X_{jl}$. Since both $X_{ik}$ and $X_{ki}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25).
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}^2) = \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} \left( \frac{\partial X_{ik}}{\partial X_{jl}} X_{ki} + X_{ik} \frac{\partial X_{ki}}{\partial X_{jl}} \right)
\label{eq:5-6-3}
\end{equation}
Computing the first term. Substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$, only the $i = j$ and $k = l$ terms survive:
\begin{equation}
\sum_{i=0}^{N-1} \sum_{k=0}^{N-1} \delta_{ij} \delta_{kl} X_{ki} = X_{lj}
\label{eq:5-6-4}
\end{equation}
Computing the second term. Substituting $\displaystyle\frac{\partial X_{ki}}{\partial X_{jl}} = \delta_{kj} \delta_{il}$, only the $k = j$ and $i = l$ terms survive:
\begin{equation}
\sum_{i=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} \delta_{kj} \delta_{il} = X_{lj}
\label{eq:5-6-5}
\end{equation}
Combining the first and second terms:
\begin{equation}
\frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}^2) = X_{lj} + X_{lj} = 2X_{lj}
\label{eq:5-6-6}
\end{equation}
Since $X_{lj}$ is the $(j, l)$ entry of the transpose $\boldsymbol{X}^\top$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top
\label{eq:5-6-7}
\end{equation}
Remark: The factor of 2 appears because in $\text{tr}(\boldsymbol{X}^2) = \sum_{i,k} X_{ik} X_{ki}$, $X_{jl}$ contributes both as the first factor and as the second factor. When $\boldsymbol{X}$ is symmetric, $\boldsymbol{X}^\top = \boldsymbol{X}$, so the result becomes $2\boldsymbol{X}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix variable, $\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is an $N \times N$ constant matrix
Proof
Since the $(i, j)$ entry of $\boldsymbol{X}^2$ is $(\boldsymbol{X}^2)_{ij} = \sum_{k=0}^{N-1} X_{ik} X_{kj}$, the trace is:
\begin{equation}
\text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} X_{kj} A_{ji}
\label{eq:5-7-1}
\end{equation}
Differentiating with respect to $X_{pq}$. Since both $X_{ik}$ and $X_{kj}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25).
\begin{equation}
\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = \sum_{i,j,k} \left( \frac{\partial X_{ik}}{\partial X_{pq}} X_{kj} A_{ji} + X_{ik} \frac{\partial X_{kj}}{\partial X_{pq}} A_{ji} \right)
\label{eq:5-7-2}
\end{equation}
Computing the first term. Substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{pq}} = \delta_{ip} \delta_{kq}$ selects $i = p$ and $k = q$:
\begin{equation}
\sum_{i,j,k} \delta_{ip} \delta_{kq} X_{kj} A_{ji} = \sum_{j} X_{qj} A_{jp} = (\boldsymbol{X}\boldsymbol{A})_{qp}
\label{eq:5-7-3}
\end{equation}
Computing the second term. Substituting $\displaystyle\frac{\partial X_{kj}}{\partial X_{pq}} = \delta_{kp} \delta_{jq}$ selects $k = p$ and $j = q$:
\begin{equation}
\sum_{i,j,k} X_{ik} \delta_{kp} \delta_{jq} A_{ji} = \sum_{i} X_{ip} A_{qi} = (\boldsymbol{A}\boldsymbol{X})_{qp}
\label{eq:5-7-4}
\end{equation}
Combining the first and second terms:
\begin{equation}
\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A})_{qp} + (\boldsymbol{A}\boldsymbol{X})_{qp}
\label{eq:5-7-5}
\end{equation}
The $(q,p)$ entry is the $(p,q)$ entry of the transpose, and by linearity of the transpose:
\begin{equation}
(\boldsymbol{X}\boldsymbol{A})_{qp} + (\boldsymbol{A}\boldsymbol{X})_{qp} = ((\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top)_{pq}
\label{eq:5-7-6}
\end{equation}
Since \eqref{eq:5-7-6} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top
\label{eq:5-7-7}
\end{equation}
Remark: When $\boldsymbol{A}$ is symmetric and $\boldsymbol{X}$ is also symmetric, $\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X}$ is symmetric, and the result simplifies to $\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X}$. When $\boldsymbol{A} = \boldsymbol{I}$, this reduces to
5.6.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{M \times N}$, $\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is an $N \times N$ constant matrix, $\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X} \in \mathbb{R}^{M \times M}$ is a square matrix
Proof
Writing out the $(i, k)$ entry of $\boldsymbol{A}\boldsymbol{X}$:
\begin{equation}
(\boldsymbol{A}\boldsymbol{X})_{ik} = \sum_{j=0}^{N-1} A_{ij} X_{jk}
\label{eq:5-8-1}
\end{equation}
The $(l, l)$ entry of $\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}$ is:
\begin{equation}
(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} (\boldsymbol{X}^\top)_{li} (\boldsymbol{A}\boldsymbol{X})_{il}
\label{eq:5-8-2}
\end{equation}
Substituting $(\boldsymbol{X}^\top)_{li} = X_{il}$ and using \eqref{eq:5-8-1}:
\begin{equation}
(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} X_{il} \sum_{j=0}^{N-1} A_{ij} X_{jl}
\label{eq:5-8-3}
\end{equation}
Rearranging the sums:
\begin{equation}
(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} X_{il} A_{ij} X_{jl}
\label{eq:5-8-4}
\end{equation}
Since the trace is the sum of diagonal elements, summing over $l$:
\begin{equation}
\text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l=0}^{M-1} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} X_{il} A_{ij} X_{jl}
\label{eq:5-8-5}
\end{equation}
Differentiating with respect to $X_{pq}$. Since both $X_{il}$ and $X_{jl}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25):
\begin{equation}
\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l,i,j} \left( \frac{\partial X_{il}}{\partial X_{pq}} A_{ij} X_{jl} + X_{il} A_{ij} \frac{\partial X_{jl}}{\partial X_{pq}} \right)
\label{eq:5-8-6}
\end{equation}
Computing the first term. Substituting $\displaystyle\frac{\partial X_{il}}{\partial X_{pq}} = \delta_{ip} \delta_{lq}$:
\begin{equation}
\sum_{l,i,j} \delta_{ip} \delta_{lq} A_{ij} X_{jl} = \sum_{j} A_{pj} X_{jq}
\label{eq:5-8-7}
\end{equation}
Here $\delta_{ip}$ selects $i = p$ and $\delta_{lq}$ selects $l = q$.
Rewriting \eqref{eq:5-8-7} as a matrix product:
\begin{equation}
\sum_{j} A_{pj} X_{jq} = (\boldsymbol{A}\boldsymbol{X})_{pq}
\label{eq:5-8-8}
\end{equation}
Computing the second term. Substituting $\displaystyle\frac{\partial X_{jl}}{\partial X_{pq}} = \delta_{jp} \delta_{lq}$:
\begin{equation}
\sum_{l,i,j} X_{il} A_{ij} \delta_{jp} \delta_{lq} = \sum_{i} X_{iq} A_{ip}
\label{eq:5-8-9}
\end{equation}
Here $\delta_{jp}$ selects $j = p$ and $\delta_{lq}$ selects $l = q$.
Transforming \eqref{eq:5-8-9}. Using $A_{ip} = (\boldsymbol{A}^\top)_{pi}$:
\begin{equation}
\sum_{i} X_{iq} A_{ip} = \sum_{i} (\boldsymbol{A}^\top)_{pi} X_{iq}
\label{eq:5-8-10}
\end{equation}
Rewriting as a matrix product:
\begin{equation}
\sum_{i} (\boldsymbol{A}^\top)_{pi} X_{iq} = (\boldsymbol{A}^\top\boldsymbol{X})_{pq}
\label{eq:5-8-11}
\end{equation}
Combining the first term \eqref{eq:5-8-8} and the second term \eqref{eq:5-8-11}:
\begin{equation}
\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = (\boldsymbol{A}\boldsymbol{X})_{pq} + (\boldsymbol{A}^\top\boldsymbol{X})_{pq}
\label{eq:5-8-12}
\end{equation}
Since \eqref{eq:5-8-12} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}
\label{eq:5-8-13}
\end{equation}
Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result becomes $2\boldsymbol{A}\boldsymbol{X}$. This formula is a generalization of the quadratic form, and $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l} \boldsymbol{x}_l^\top \boldsymbol{A} \boldsymbol{x}_l$ where $\boldsymbol{x}_l$ denotes the $l$-th column of $\boldsymbol{X}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})$, so this gives the same result as 5.8.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})$, so this gives the same result as 5.8.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
Writing the trace in component form:
\begin{eqnarray}
\text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \displaystyle\sum_{i,j,k} X_{ij} A_{jk} X_{ik}
\end{eqnarray}
Differentiating with respect to $X_{pq}$, there are two types of terms containing $X_{pq}$.
Case 1: $X_{ij} = X_{pq}$ (i.e., $i = p, j = q$)
\begin{align}
\frac{\partial}{\partial X_{pq}} \sum_{k} X_{pq} A_{qk} X_{pk}
&= \sum_{k} A_{qk} X_{pk} \notag \\
&= \sum_{k} X_{pk} A_{qk} = \sum_{k} X_{pk} (\boldsymbol{A}^\top)_{kq} \notag \\
&= (\boldsymbol{X}\boldsymbol{A}^\top)_{pq} \notag
\end{align}
Case 2: $X_{ik} = X_{pq}$ (i.e., $i = p, k = q$)
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \sum_{j} X_{pj} A_{jq} X_{pq}
&=&
\sum_{j} X_{pj} A_{jq} = (\boldsymbol{X}\boldsymbol{A})_{pq}
\end{eqnarray}
Combining the above:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)
= (\boldsymbol{X}\boldsymbol{A}^\top)_{pq} + (\boldsymbol{X}\boldsymbol{A})_{pq}
\end{eqnarray}
Remark: When $\boldsymbol{A}$ is symmetric, $\boldsymbol{A} = \boldsymbol{A}^\top$, and the result becomes $2\boldsymbol{X}\boldsymbol{A}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)$, so this gives the same result as 5.11.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)$, so this gives the same result as 5.11.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top + \boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top$
Conditions: $\boldsymbol{A}, \boldsymbol{X}, \boldsymbol{B}$ are $N \times N$ square matrices
Proof
Writing the trace in component form:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \displaystyle\sum_{i,j,k,l} A_{ij} X_{jk} B_{kl} X_{li}
\end{eqnarray}
Differentiating with respect to $X_{pq}$, there are two types of terms containing $X_{pq}$.
Case 1: $X_{jk} = X_{pq}$ (i.e., $j = p, k = q$)
\begin{align}
\frac{\partial}{\partial X_{pq}} \sum_{i,l} A_{ip} X_{pq} B_{ql} X_{li}
&= \sum_{i,l} A_{ip} B_{ql} X_{li} \notag \\
&= \sum_{i} A_{ip} \sum_{l} B_{ql} X_{li}
= \sum_{i} A_{ip} (\boldsymbol{B}\boldsymbol{X})_{qi} \notag \\
&= \sum_{i} (\boldsymbol{A}^\top)_{pi} (\boldsymbol{X}^\top\boldsymbol{B}^\top)_{iq}
= (\boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top)_{pq} \notag
\end{align}
Case 2: $X_{li} = X_{pq}$ (i.e., $l = p, i = q$)
\begin{align}
\frac{\partial}{\partial X_{pq}} \sum_{j,k} A_{qj} X_{jk} B_{kp} X_{pq}
&= \sum_{j,k} A_{qj} X_{jk} B_{kp} \notag \\
&= \sum_{k} B_{kp} \sum_{j} A_{qj} X_{jk}
= \sum_{k} B_{kp} (\boldsymbol{A}\boldsymbol{X})_{qk} \notag \\
&= \sum_{k} (\boldsymbol{B}^\top)_{pk} (\boldsymbol{X}^\top\boldsymbol{A}^\top)_{kq}
= (\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top)_{pq} \notag
\end{align}
Combining the above:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X})
= (\boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top)_{pq} + (\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top)_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = 2\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is a matrix of arbitrary size
Proof
Setting $\boldsymbol{B} = \boldsymbol{I}$ in 5.8 gives $\boldsymbol{I}\boldsymbol{X} + \boldsymbol{I}^\top\boldsymbol{X} = 2\boldsymbol{X}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is a matrix of arbitrary size
Proof
By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X})$, so this gives the same result as 5.15.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$
Conditions: $\boldsymbol{B}, \boldsymbol{C}$ are constant matrices
Proof
Setting $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{B}$, we get $\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{Y})$. By 5.8, $\displaystyle\frac{\partial}{\partial \boldsymbol{Y}} \text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{Y}) = (\boldsymbol{C} + \boldsymbol{C}^\top)\boldsymbol{Y}$. Applying the chain rule (matrix version of 1.26):
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
= \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \boldsymbol{B}\boldsymbol{X}\boldsymbol{C} + \boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $N \times N$ constant matrix, $\boldsymbol{C}$ is an $M \times M$ constant matrix
Proof
Writing the trace in component form, $\text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \displaystyle\sum_{i,j,k,l} X_{ji} B_{jk} X_{kl} C_{li}$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C})
= (\boldsymbol{B}\boldsymbol{X}\boldsymbol{C})_{pq} + (\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top)_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}, \boldsymbol{C}$ are $M \times M$ constant matrices, $\boldsymbol{B}$ is an $N \times N$ constant matrix
Proof
Writing the trace in component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \displaystyle\sum_{i,j,k,l,m} A_{ij} X_{jk} B_{kl} X_{ml} C_{mi}$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C})
= (\boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top)_{pq} + (\boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})^\top] = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})\boldsymbol{B}^\top$
Conditions: $\boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}$ are constant matrices
Proof
Setting $\boldsymbol{Y} = \boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C}$, we get $\text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top) = \|\boldsymbol{Y}\|_F^2$. By 5.16, $\displaystyle\frac{\partial}{\partial \boldsymbol{Y}} \text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top) = 2\boldsymbol{Y}$. Applying the chain rule (1.26), since $\displaystyle\frac{\partial Y_{ij}}{\partial X_{pq}} = A_{ip} B_{qj}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top)
= 2 (\boldsymbol{A}^\top \boldsymbol{Y} \boldsymbol{B}^\top)_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = 2\text{tr}(\boldsymbol{X})\boldsymbol{I}$
Conditions: $\boldsymbol{X}$ is a square matrix, $\otimes$ denotes the Kronecker product
Proof
By the trace property of the Kronecker product $\text{tr}(\boldsymbol{A} \otimes \boldsymbol{B}) = \text{tr}(\boldsymbol{A})\text{tr}(\boldsymbol{B})$, we have $\text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = [\text{tr}(\boldsymbol{X})]^2$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} [\text{tr}(\boldsymbol{X})]^2
= 2\text{tr}(\boldsymbol{X}) \cdot \delta_{pq}
= (2\text{tr}(\boldsymbol{X})\boldsymbol{I})_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
We have $\text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = \displaystyle\sum_{i,j,k} X_{ik} A_{ij} X_{jk}$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})
&=&
(\boldsymbol{A} \boldsymbol{X})_{pq} + (\boldsymbol{A}^\top \boldsymbol{X})_{pq}
=
((\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{X})_{pq}
\end{eqnarray}
Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result becomes $2\boldsymbol{A}\boldsymbol{X}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{A}^\top \boldsymbol{B}^\top$
Conditions: $\boldsymbol{A}$ is an $L \times N$ constant matrix, $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $M \times L$ constant matrix
Proof
In component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \displaystyle\sum_{l,i,j} A_{li} X_{ij} B_{jl}$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})
=
\displaystyle\sum_{l} A_{lp} B_{ql}
=
(\boldsymbol{A}^\top \boldsymbol{B}^\top)_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \boldsymbol{B}\boldsymbol{A}$
Conditions: $\boldsymbol{A}$ is an $M \times N$ constant matrix, $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $N \times M$ constant matrix
Proof
Writing the trace in component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \displaystyle\sum_{i,j,k} A_{ij} X_{kj} B_{ki}$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B})
= \displaystyle\sum_{i} B_{pi} A_{iq}
= (\boldsymbol{B}\boldsymbol{A})_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A})\boldsymbol{I}$
Conditions: $\boldsymbol{A}$ is an $M \times M$ constant matrix, $\boldsymbol{X}$ is an $N \times N$ matrix, $\otimes$ denotes the Kronecker product
Proof
By the trace property of the Kronecker product, $\text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A}) \cdot \text{tr}(\boldsymbol{X})$. Differentiating with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X})
= \text{tr}(\boldsymbol{A}) \cdot \delta_{pq}
= (\text{tr}(\boldsymbol{A})\boldsymbol{I})_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A}) = -\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ invertible matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof
Differentiating the identity $\boldsymbol{X} \boldsymbol{X}^{-1} = \boldsymbol{I}$ with respect to $X_{pq}$ gives $\displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} = -\boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1}$. Using this to differentiate $\text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A})$ with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A})
&=&
-\text{tr}(\boldsymbol{X}^{-1} \boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq})
=
-(\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top})_{pq}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $k$ is a positive integer
Proof
Applying the chain rule (1.26) to the derivative of a matrix power gives $\displaystyle\frac{\partial \boldsymbol{X}^k}{\partial X_{pq}} = \displaystyle\sum_{r=0}^{k-1} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1}$. Using the cyclic property of trace:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^k)
= k \cdot \text{tr}(\boldsymbol{X}^{k-1} \boldsymbol{E}_{pq})
= k (\boldsymbol{X}^{k-1})_{qp}
= k ((\boldsymbol{X}^{k-1})^\top)_{pq}
\end{eqnarray}
Remark: For $k = 2$, we get $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$, which is consistent with 5.6.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$
Conditions: $\boldsymbol{X}$, $\boldsymbol{A}$ are $N \times N$ matrices, $k$ is a positive integer
Proof
Similarly to 5.27, we compute the derivative of the matrix power.
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k)
&=&
\text{tr}\left( \boldsymbol{A} \displaystyle\frac{\partial \boldsymbol{X}^k}{\partial X_{pq}} \right) \\
&=&
\text{tr}\left( \boldsymbol{A} \displaystyle\sum_{r=0}^{k-1} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1} \right) \\
&=&
\displaystyle\sum_{r=0}^{k-1} \text{tr}(\boldsymbol{A} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1}) \\
&=&
\displaystyle\sum_{r=0}^{k-1} \text{tr}(\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r \boldsymbol{E}_{pq}) \quad (\text{cyclic property of trace})
\end{eqnarray}
Using $\text{tr}(\boldsymbol{M} \boldsymbol{E}_{pq}) = M_{qp}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k)
&=&
\displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r)_{qp} \\
&=&
\displaystyle\sum_{r=0}^{k-1} ((\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r)^\top)_{pq} \\
&=&
\displaystyle\sum_{r=0}^{k-1} ((\boldsymbol{X}^r)^\top \boldsymbol{A}^\top (\boldsymbol{X}^{k-r-1})^\top)_{pq}
\end{eqnarray}
Performing the substitution $s = k - r - 1$ ($r = k - s - 1$):
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k)
&=&
\displaystyle\sum_{s=0}^{k-1} ((\boldsymbol{X}^{k-s-1})^\top \boldsymbol{A}^\top (\boldsymbol{X}^s)^\top)_{pq} \\
&=&
\displaystyle\sum_{s=0}^{k-1} ((\boldsymbol{X}^s \boldsymbol{A} \boldsymbol{X}^{k-s-1})^\top)_{pq}
\end{eqnarray}
Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$.
Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result may simplify.
In particular, when $\boldsymbol{A} = \boldsymbol{I}$, this gives $\text{tr}(\boldsymbol{X}^k)$, which is consistent with 5.27.
Formula:
$$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}$$
$$\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X} + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{B}$ is an $N \times K$ matrix, $\boldsymbol{C}$ is an $M \times M$ matrix
Proof
Since $\boldsymbol{X}$ appears in four positions in this compound expression, the derivative is the sum of the results obtained by differentiating at each position.
Setting $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{B}$, the original expression can be written as $\text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{Y})$.
We differentiate with respect to each occurrence of $\boldsymbol{X}$.
Term 1 (differentiation at the leftmost $\boldsymbol{X}^\top$):
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
&=&
\text{tr}(\boldsymbol{B}^\top \boldsymbol{E}_{pq}^\top \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
\end{eqnarray}
Using the cyclic property of trace and $\text{tr}(\boldsymbol{E}_{qp}\boldsymbol{M}) = M_{pq}$, this term gives $(\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top)_{pq}$.
Term 2 (differentiation at $\boldsymbol{X}$ in $\boldsymbol{C}\boldsymbol{X}$):
\begin{eqnarray}
\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{E}_{pq}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
\end{eqnarray}
Using the cyclic property of trace, this term gives $(\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X})_{pq}$.
Term 3 (differentiation at $\boldsymbol{X}^\top$ in $\boldsymbol{X}\boldsymbol{X}^\top$):
\begin{eqnarray}
\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{E}_{pq}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
\end{eqnarray}
This term gives $(\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})_{pq}$.
Term 4 (differentiation at the rightmost $\boldsymbol{X}$):
\begin{eqnarray}
\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{E}_{pq}\boldsymbol{B})
\end{eqnarray}
This term gives $(\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top)_{pq}$.
Combining all four terms:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})
&= \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top
+ \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X} \notag \\
&\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}
+ \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top \notag
\end{align}
Remark: Although this formula is complex, each term is the result of applying the chain rule at each of the four positions where $\boldsymbol{X}$ appears.
When $\boldsymbol{C}$ is symmetric ($\boldsymbol{C} = \boldsymbol{C}^\top$), the first and fourth terms, and the second and third terms, respectively take similar forms.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ invertible matrix, $\boldsymbol{A}$, $\boldsymbol{B}$ are constant matrices of appropriate size
Proof
We use the inverse matrix derivative formula derived in 8.2:
\begin{eqnarray}
\displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} = -\boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1}
\end{eqnarray}
where $\boldsymbol{E}_{pq}$ is the matrix with 1 only at position $(p, q)$.
Differentiating $\text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B})$ with respect to $X_{pq}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B})
&=&
\text{tr}\left( \boldsymbol{A} \displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} \boldsymbol{B} \right) \\
&=&
\text{tr}(-\boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1} \boldsymbol{B}) \\
&=&
-\text{tr}(\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq}) \quad (\text{cyclic property of trace})
\end{eqnarray}
Since $\text{tr}(\boldsymbol{M} \boldsymbol{E}_{pq}) = M_{qp}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B})
&=&
-(\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1})_{qp} \\
&=&
-((\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1})^\top)_{pq} \\
&=&
-(\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{B}^\top \boldsymbol{X}^{-\top})_{pq}
\end{eqnarray}
Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$.
Remark: This is equivalent to $-(\boldsymbol{X}^{-1}\boldsymbol{B}\boldsymbol{A}\boldsymbol{X}^{-1})^\top$.
When $\boldsymbol{A} = \boldsymbol{I}$, this reduces to the formula in 4.4.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] = -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{C}$ is an $N \times N$ symmetric matrix, $\boldsymbol{A}$ is an $M \times M$ matrix
Proof
Let $\boldsymbol{W} = \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ (an $M \times M$ matrix).
First, we compute the derivative of $\boldsymbol{W}$ with respect to $X_{pq}$. Since $W_{ij} = \displaystyle\sum_{k,l} X_{ki} C_{kl} X_{lj}$:
\begin{eqnarray}
\displaystyle\frac{\partial W_{ij}}{\partial X_{pq}}
&=& \displaystyle\sum_l C_{pl} X_{lj} \delta_{iq} + \displaystyle\sum_k X_{ki} C_{kp} \delta_{jq} \\
&=& (\boldsymbol{C}\boldsymbol{X})_{pj} \delta_{iq} + (\boldsymbol{X}^\top\boldsymbol{C})_{ip} \delta_{jq}
\end{eqnarray}
Using the inverse matrix derivative formula $\displaystyle\frac{\partial \boldsymbol{W}^{-1}}{\partial W_{ij}} = -\boldsymbol{W}^{-1} \boldsymbol{E}_{ij} \boldsymbol{W}^{-1}$ and the chain rule:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A})
&=& \displaystyle\sum_{i,j} \displaystyle\frac{\partial \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A})}{\partial W_{ij}} \cdot \displaystyle\frac{\partial W_{ij}}{\partial X_{pq}}
\end{eqnarray}
Substituting $\displaystyle\frac{\partial}{\partial W_{ij}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A}) = -(\boldsymbol{W}^{-1}\boldsymbol{A}\boldsymbol{W}^{-1})_{ji}$
and using $\boldsymbol{C} = \boldsymbol{C}^\top$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A})
&=& -(\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{A}\boldsymbol{W}^{-1})_{pq} - (\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{A}^\top\boldsymbol{W}^{-1})_{pq}
\end{eqnarray}
Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] = -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$.
Formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]
&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\
&\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag
\end{align}
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$, $\boldsymbol{C}$ are $N \times N$ symmetric matrices
Proof
Let $\boldsymbol{W} = \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ and $\boldsymbol{V} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X}$.
By the product rule:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{V})
&=& \left(\displaystyle\frac{\partial \boldsymbol{W}^{-1}}{\partial \boldsymbol{X}}\right) \boldsymbol{V} + \boldsymbol{W}^{-1} \left(\displaystyle\frac{\partial \boldsymbol{V}}{\partial \boldsymbol{X}}\right)
\end{eqnarray}
Term 1 (derivative of $\boldsymbol{W}^{-1}$, with $\boldsymbol{V}$ fixed):
Substituting $\boldsymbol{A} = \boldsymbol{V} = \boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}$ (symmetric) into the result of 5.31:
\begin{eqnarray}
-\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}(\boldsymbol{V}+\boldsymbol{V}^\top)\boldsymbol{W}^{-1}
&=& -2\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{V}\boldsymbol{W}^{-1} \\
&=& -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}
\end{eqnarray}
Term 2 (derivative of $\boldsymbol{V}$, with $\boldsymbol{W}^{-1}$ fixed):
By 5.22, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}) = 2\boldsymbol{B}\boldsymbol{X}\boldsymbol{W}^{-1}$ (when $\boldsymbol{B}$ is symmetric).
Therefore:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]
&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\
&\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag
\end{align}
Formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]
&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\
&\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag
\end{align}
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $M \times M$ constant matrix, $\boldsymbol{B}$, $\boldsymbol{C}$ are $N \times N$ symmetric matrices
Proof
Let $\boldsymbol{W} = \boldsymbol{A} + \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ and $\boldsymbol{V} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X}$.
In the derivative of $\boldsymbol{W}$ with respect to $X_{pq}$, the constant term $\boldsymbol{A}$ vanishes:
\begin{eqnarray}
\displaystyle\frac{\partial W_{ij}}{\partial X_{pq}} = (\boldsymbol{C}\boldsymbol{X})_{pj} \delta_{iq} + (\boldsymbol{X}^\top\boldsymbol{C})_{ip} \delta_{jq}
\end{eqnarray}
This has the same form as in 5.32, so the result is obtained simply by replacing $\boldsymbol{W} = \boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}$ with $\boldsymbol{W} = \boldsymbol{A} + \boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}$:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]
&= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\
&\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag
\end{align}
Remark: This family of formulas plays an important role in the derivation of least squares and generalized least squares (GLS) estimators.
The form $(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$ appears in the variance-covariance matrix of the weighted least squares estimator.
Trace Derivatives of Elementary Functions
We discuss the derivative of the trace of a matrix function $f(\boldsymbol{X})$.
Matrix functions are defined by their Taylor series, and for a diagonalizable matrix $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$,
$f(\boldsymbol{X}) = \boldsymbol{P} f(\boldsymbol{\Lambda}) \boldsymbol{P}^{-1}$,
where $f(\boldsymbol{\Lambda})$ is the diagonal matrix obtained by applying $f$ to the eigenvalues.
In general, the derivative of the trace of a matrix function, when the matrix is diagonalizable with distinct eigenvalues, is given by:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X})) = f'(\boldsymbol{X})^\top
\end{eqnarray}
where $f'(\boldsymbol{X})$ is the derivative of $f$ applied to the matrix $\boldsymbol{X}$.
We prove this for individual functions below.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X})) = \exp(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
The matrix exponential is defined by its Taylor series:
\begin{eqnarray}
\exp(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^k}{k!}
= \boldsymbol{I} + \boldsymbol{X} + \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^3}{3!} + \cdots
\end{eqnarray}
Taking the trace and differentiating term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{k}{k!} (\boldsymbol{X}^{k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(k-1)!} (\boldsymbol{X}^{k-1})^\top \\
&=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{m!} (\boldsymbol{X}^m)^\top \quad (m = k-1) \\
&=& \exp(\boldsymbol{X})^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X})) = \boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ positive definite matrix
Proof
We consider the matrix logarithm when $\boldsymbol{X}$ is positive definite.
Using the trace property $\text{tr}(\log(\boldsymbol{X})) = \log(|\boldsymbol{X}|)$,
which follows from the diagonalization $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$:
$\text{tr}(\log(\boldsymbol{X})) = \displaystyle\sum_i \log(\lambda_i) = \log(\prod_i \lambda_i) = \log(|\boldsymbol{X}|)$.
Using the determinant derivative formula $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}| = |\boldsymbol{X}| \boldsymbol{X}^{-\top}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X}))
&=& \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log(|\boldsymbol{X}|) \\
&=& \displaystyle\frac{1}{|\boldsymbol{X}|} \cdot |\boldsymbol{X}| \boldsymbol{X}^{-\top} \\
&=& \boldsymbol{X}^{-\top}
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sqrt{\boldsymbol{X}}) = \displaystyle\frac{1}{2}(\boldsymbol{X}^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ positive definite matrix
Proof
When $\boldsymbol{X}$ is positive definite, a unique positive definite square root $\boldsymbol{X}^{1/2}$ exists.
Using the generalization of 5.27 with $n = 1/2$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{1/2})
&=& \displaystyle\frac{1}{2} (\boldsymbol{X}^{1/2-1})^\top \\
&=& \displaystyle\frac{1}{2} (\boldsymbol{X}^{-1/2})^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X})) = \cos(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
The matrix sine function is defined by its Taylor series:
\begin{eqnarray}
\sin(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \boldsymbol{X}^{2k+1}
= \boldsymbol{X} - \displaystyle\frac{\boldsymbol{X}^3}{3!} + \displaystyle\frac{\boldsymbol{X}^5}{5!} - \cdots
\end{eqnarray}
Taking the trace and applying the formula from 5.27, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^n) = n(\boldsymbol{X}^{n-1})^\top$, term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X}))
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} (2k+1)(\boldsymbol{X}^{2k})^\top \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (\boldsymbol{X}^{2k})^\top \\
&=& \left( \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k} \right)^\top \\
&=& \cos(\boldsymbol{X})^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X})) = -\sin(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
The matrix cosine function is defined by its Taylor series:
\begin{eqnarray}
\cos(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k}
= \boldsymbol{I} - \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^4}{4!} - \cdots
\end{eqnarray}
Taking the trace and differentiating term by term. The $k=0$ term is constant $\boldsymbol{I}$, so its derivative is $\boldsymbol{O}$:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (2k)(\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k-1)!} (\boldsymbol{X}^{2k-1})^\top \\
&=& -\displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{(-1)^m}{(2m+1)!} (\boldsymbol{X}^{2m+1})^\top \quad (m = k-1) \\
&=& -\sin(\boldsymbol{X})^\top
\end{eqnarray}
Formula:
$$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X})) = f'(\boldsymbol{X})^\top$$
More generally, when $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$:
$$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = (\boldsymbol{A}f'(\boldsymbol{X}))^\top$$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $f$ is analytic (has a Taylor series expansion). For the version with $\boldsymbol{A}$, $\boldsymbol{A}$ and $\boldsymbol{X}$ must commute.
Proof
Since $f$ is analytic, it has a Taylor series expansion $f(x) = \displaystyle\sum_{k=0}^{\infty} c_k x^k$.
The matrix function is defined as $f(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} c_k \boldsymbol{X}^k$, so:
\begin{align}
\text{tr}(f(\boldsymbol{X})) = \sum_{k=0}^{\infty} c_k \,\text{tr}(\boldsymbol{X}^k) \notag
\end{align}
By the method of 5.34 (term-by-term differentiation of power traces), $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$.
This gives the same result whether one differentiates $\text{tr}(\boldsymbol{X}^k) = \displaystyle\sum_i \lambda_i^k$ as a scalar or differentiates each term of the Taylor series directly.
Applying term-by-term differentiation:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X}))
&= \sum_{k=1}^{\infty} c_k \cdot k (\boldsymbol{X}^{k-1})^\top \notag \\
&= \left( \sum_{k=1}^{\infty} k\, c_k \boldsymbol{X}^{k-1} \right)^\top = f'(\boldsymbol{X})^\top \notag
\end{align}
Here $f'(x) = \displaystyle\sum_{k=1}^{\infty} k\, c_k x^{k-1}$ is the scalar derivative of $f$, and $f'(\boldsymbol{X})$ is obtained by substituting $\boldsymbol{X}$ into this series.
For the version with $\boldsymbol{A}$: when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, simultaneous diagonalization is possible: $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, $\boldsymbol{A} = \boldsymbol{P}\boldsymbol{D}\boldsymbol{P}^{-1}$ ($\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_N)$, $\boldsymbol{D} = \text{diag}(d_1, \ldots, d_N)$). Then:
\begin{align}
\text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = \sum_{i=1}^{N} d_i f(\lambda_i) \notag
\end{align}
Taking the scalar derivative $f'(\lambda_i)$ with respect to each $\lambda_i$ and reconstructing in matrix form, the same argument gives $(\boldsymbol{A}f'(\boldsymbol{X}))^\top$. $\square$
Remark: By this general formula, all formulas 5.39--5.58 below follow by simply substituting the appropriate $f$ and $f'$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\cos(\boldsymbol{X})$ is invertible
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\tan(x) = \sec^2(x)$.
Substituting $f(x) = \tan(x)$, $f'(x) = \sec^2(x)$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top \qquad \square \notag
\end{align}
Here $\sec^2(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-2}$ is defined as the square of the inverse of the matrix cosine.
Remark: Here $\sec(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-1}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arcsin(x) = \displaystyle\frac{1}{\sqrt{1-x^2}}$.
Substituting $f(x) = \arcsin(x)$, $f'(x) = (1-x^2)^{-1/2}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arccos(x) = -\displaystyle\frac{1}{\sqrt{1-x^2}}$.
Substituting $f(x) = \arccos(x)$, $f'(x) = -(1-x^2)^{-1/2}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag
\end{align}
Here the matrix $f'(\boldsymbol{X})$ is the same $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ as in 5.40, differing only in sign.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arctan(x) = \displaystyle\frac{1}{1+x^2}$.
Substituting $f(x) = \arctan(x)$, $f'(x) = (1+x^2)^{-1}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X})) = \cosh(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
The matrix hyperbolic sine function is defined by its Taylor series:
\begin{eqnarray}
\sinh(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^{2k+1}}{(2k+1)!}
= \boldsymbol{X} + \displaystyle\frac{\boldsymbol{X}^3}{3!} + \displaystyle\frac{\boldsymbol{X}^5}{5!} + \cdots
\end{eqnarray}
Differentiating term by term as in 5.37:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X}))
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(2k+1)}{(2k+1)!} (\boldsymbol{X}^{2k})^\top \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} (\boldsymbol{X}^{2k})^\top \\
&=& \cosh(\boldsymbol{X})^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X})) = \sinh(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
The matrix hyperbolic cosine function is defined by its Taylor series:
\begin{eqnarray}
\cosh(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^{2k}}{(2k)!}
= \boldsymbol{I} + \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^4}{4!} + \cdots
\end{eqnarray}
Differentiating term by term as in 5.38:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(2k)}{(2k)!} (\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k-1)!} (\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{(2m+1)!} (\boldsymbol{X}^{2m+1})^\top \quad (m = k-1) \\
&=& \sinh(\boldsymbol{X})^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\cosh(\boldsymbol{X})$ is invertible
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\tanh(x) = \text{sech}^2(x)$.
Substituting $f(x) = \tanh(x)$, $f'(x) = \text{sech}^2(x)$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top \qquad \square \notag
\end{align}
Here $\text{sech}^2(\boldsymbol{X}) = \cosh(\boldsymbol{X})^{-2}$ is defined as the square of the inverse of the matrix hyperbolic cosine.
Remark: Here $\text{sech}(\boldsymbol{X}) = \cosh(\boldsymbol{X})^{-1}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\text{arcsinh}(x) = \displaystyle\frac{1}{\sqrt{1+x^2}}$.
Substituting $f(x) = \text{arcsinh}(x)$, $f'(x) = (1+x^2)^{-1/2}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, all eigenvalues are greater than $1$
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\text{arccosh}(x) = \displaystyle\frac{1}{\sqrt{x^2-1}}$ ($x > 1$).
Substituting $f(x) = \text{arccosh}(x)$, $f'(x) = (x^2-1)^{-1/2}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{X}^2-\boldsymbol{I})^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{X}^2-\boldsymbol{I}$, which is defined when all eigenvalues are greater than $1$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\text{arctanh}(x) = \displaystyle\frac{1}{1-x^2}$ ($|x| < 1$).
Substituting $f(x) = \text{arctanh}(x)$, $f'(x) = (1-x^2)^{-1}$ into the general formula:
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) = (\boldsymbol{A}\cos(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
Using the Taylor series as in 5.37:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\sin(\boldsymbol{X}))
&=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \boldsymbol{X}^{2k+1} \right) \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k+1})
\end{eqnarray}
Using the formula from 5.28, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^n) = n(\boldsymbol{A}\boldsymbol{X}^{n-1})^\top$ (when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute):
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X}))
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} (2k+1)(\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\
&=& (\boldsymbol{A}\cos(\boldsymbol{X}))^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) = (\boldsymbol{A}\exp(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
Using the Taylor series as in 5.37:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\exp(\boldsymbol{X}))
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{k!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k)
\end{eqnarray}
When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{k}{k!} (\boldsymbol{A}\boldsymbol{X}^{k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(k-1)!} (\boldsymbol{A}\boldsymbol{X}^{k-1})^\top \\
&=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{m!} (\boldsymbol{A}\boldsymbol{X}^m)^\top \\
&=& (\boldsymbol{A}\exp(\boldsymbol{X}))^\top
\end{eqnarray}
Remark: This formula holds when the matrices $\boldsymbol{A}$ and $\boldsymbol{X}$ commute ($\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$).
In the non-commuting case, the differentiation becomes more complex, and one needs to use the Fréchet derivative formalism.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) = -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
Using the Taylor series:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\cos(\boldsymbol{X}))
&=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k} \right) \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k})
\end{eqnarray}
When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \cdot 2k \cdot (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k-1)!} (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\
&=& -\displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{(-1)^m}{(2m+1)!} (\boldsymbol{A}\boldsymbol{X}^{2m+1})^\top \\
&=& -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X})) = (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting), $\cos(\boldsymbol{X})$ is invertible
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\tan(x) = \sec^2(x)$.
Substituting $f(x) = \tan(x)$, $f'(x) = \sec^2(x)$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X}))
= (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top \qquad \square \notag
\end{align}
Here $\sec^2(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-2}$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\|\boldsymbol{X}\| < 1$, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arcsin(x) = \displaystyle\frac{1}{\sqrt{1-x^2}}$.
Substituting $f(x) = \arcsin(x)$, $f'(x) = (1-x^2)^{-1/2}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X}))
= (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X})) = -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\|\boldsymbol{X}\| < 1$, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arccos(x) = -\displaystyle\frac{1}{\sqrt{1-x^2}}$.
Substituting $f(x) = \arccos(x)$, $f'(x) = -(1-x^2)^{-1/2}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X}))
= -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag
\end{align}
Here the matrix $f'(\boldsymbol{X})$ differs from 5.53 only in sign.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
In the scalar case, $\displaystyle\frac{d}{dx}\arctan(x) = \displaystyle\frac{1}{1+x^2}$.
Substituting $f(x) = \arctan(x)$, $f'(x) = (1+x^2)^{-1}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):
\begin{align}
\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X}))
= (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag
\end{align}
Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) = (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
Using the Taylor series:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X}))
&=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \boldsymbol{X}^{2k+1} \right) \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k+1})
\end{eqnarray}
When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X}))
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \cdot (2k+1) \cdot (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\
&=& (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) = (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof
Using the Taylor series:
\begin{eqnarray}
\text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X}))
&=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} \boldsymbol{X}^{2k} \right) \\
&=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k})
\end{eqnarray}
When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:
\begin{eqnarray}
\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X}))
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k)!} \cdot 2k \cdot (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k-1)!} (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\
&=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{(2m+1)!} (\boldsymbol{A}\boldsymbol{X}^{2m+1})^\top \\
&=& (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top
\end{eqnarray}
Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\text{sech}^2(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting), $\cosh(\boldsymbol{X})$ is invertible
Proof
The matrix hyperbolic tangent is defined by a power series:
$$\tanh(\boldsymbol{X}) = \boldsymbol{X} - \frac{1}{3}\boldsymbol{X}^3 + \frac{2}{15}\boldsymbol{X}^5 - \cdots$$
More precisely, $\tanh(\boldsymbol{X}) = \sinh(\boldsymbol{X})\cosh(\boldsymbol{X})^{-1}$, which is defined when $\cosh(\boldsymbol{X})$ is invertible.
Assume $\boldsymbol{X}$ is diagonalizable. Setting $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$ ($\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_N)$), the matrix function acts on the eigenvalues:
$$\tanh(\boldsymbol{X}) = \boldsymbol{P}\,\text{diag}(\tanh(\lambda_1), \ldots, \tanh(\lambda_N))\,\boldsymbol{P}^{-1}$$
By the trace property $\text{tr}(\boldsymbol{A}\boldsymbol{P}\boldsymbol{D}\boldsymbol{P}^{-1}) = \text{tr}(\boldsymbol{P}^{-1}\boldsymbol{A}\boldsymbol{P}\boldsymbol{D})$, when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute ($\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$), $\boldsymbol{A}$ can be diagonalized with the same eigenvectors as $\boldsymbol{X}$ (simultaneous diagonalization).
Setting $\boldsymbol{P}^{-1}\boldsymbol{A}\boldsymbol{P} = \text{diag}(a_1, \ldots, a_N)$:
$$\text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = \sum_{i=1}^{N} a_i \tanh(\lambda_i)$$
Consider the derivative with respect to the $(p,q)$ entry $X_{pq}$ of $\boldsymbol{X}$. When $\boldsymbol{A}$ and $\boldsymbol{X}$ are simultaneously diagonalized, $\lambda_i$ are the eigenvalues of $\boldsymbol{X}$, and $\displaystyle\frac{\partial \lambda_i}{\partial X_{pq}}$ reduces to scalar differentiation.
In the scalar case, $\displaystyle\frac{d}{d\lambda}\tanh(\lambda) = \text{sech}^2(\lambda)$, so by the chain rule:
$$\frac{\partial}{\partial X_{pq}} \sum_{i} a_i \tanh(\lambda_i) = \sum_{i} a_i\,\text{sech}^2(\lambda_i) \cdot \frac{\partial \lambda_i}{\partial X_{pq}}$$
Since $\text{sech}^2(\boldsymbol{X}) = \boldsymbol{P}\,\text{diag}(\text{sech}^2(\lambda_1), \ldots, \text{sech}^2(\lambda_N))\,\boldsymbol{P}^{-1}$, by the simultaneous diagonalization structure, the above sum can be reconstructed as entries of the matrix product $\boldsymbol{A}\,\text{sech}^2(\boldsymbol{X})$.
Applying the general formula from 5.34, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = (\boldsymbol{A}f'(\boldsymbol{X}))^\top$ (when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute), with $f = \tanh$ and $f' = \text{sech}^2$:
$$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\,\text{sech}^2(\boldsymbol{X}))^\top \qquad \square$$