Proofs Chapter 5: Trace Derivatives

Proofs Chapter 5: Trace Derivatives

In this chapter, we prove derivatives of functions involving the trace (the sum of diagonal elements of a matrix). Since the trace returns a scalar value, it is frequently used as a means of constructing scalar objective functions of matrix variables in machine learning loss function design, principal component analysis (PCA) formulation, and covariance matrix estimation. We derive the derivatives of tr(AX), tr(X²), and tr(Xᵏ) from component-wise calculations and obtain closed-form expressions in matrix notation.

Prerequisites: Chapter 4 (Basic Formulas of Matrix Calculus). Chapters using results from this chapter: Chapter 7 (Determinant), Chapter 11 (Matrix Powers), Chapter 13 (Structured Matrices).

5. Trace Derivatives

Prerequisites for this chapter
Unless otherwise stated, all formulas in this chapter hold under the following conditions:
  • All formulas use the denominator layout convention
  • The derivative of a scalar $f$ with respect to a matrix $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ yields $\frac{\partial f}{\partial \boldsymbol{X}} \in \mathbb{R}^{M \times N}$
  • The trace is defined only for square matrices

When the matrix $\boldsymbol{X}$ is an $N \times N$ square matrix, there exist differentiation formulas involving the trace (sum of diagonal elements). Here we present the relevant formulas from the denominator layout perspective.

Definition of Trace

\begin{eqnarray} \text{tr}(\boldsymbol{X}) = \displaystyle\sum_{i=0}^{N-1} X_{ii} \end{eqnarray}

Relationship between Quadratic Forms and Trace

A quadratic form can be expressed using the trace.

\begin{eqnarray} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} &=& \text{tr}(\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = \text{tr}(\boldsymbol{A} \boldsymbol{x} \boldsymbol{x}^\top) \end{eqnarray}

This follows from the fact that $\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}$ is a scalar and the cyclic property of trace (1.12): $\text{tr}(\boldsymbol{ABC}) = \text{tr}(\boldsymbol{CAB})$.

Significance of rewriting in trace form
By expressing scalar-valued functions using the trace, differentiation can be handled uniformly as matrix operations. This is a notational device for systematizing multivariable differentiation and does not change the value itself.

Inner Product and Trace

The inner product of vectors can also be expressed using the trace.

\begin{eqnarray} \boldsymbol{a}^\top \boldsymbol{x} &=& \text{tr}(\boldsymbol{a}^\top \boldsymbol{x}) = \text{tr}(\boldsymbol{x} \boldsymbol{a}^\top) \end{eqnarray}

See Trace Derivatives for a list of formulas. We prove each formula below. Let $\boldsymbol{X}$ be an $N \times M$ matrix; in the denominator layout, the result is also an $N \times M$ matrix.

5.1 Derivative of Trace

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix, $\text{tr}(\boldsymbol{X}) \in \mathbb{R}$ is a scalar
Proof

We recall the definition of trace. The trace is the sum of the diagonal elements of a square matrix.

\begin{equation} \text{tr}(\boldsymbol{X}) = \sum_{i=0}^{N-1} X_{ii} \label{eq:5-1-1} \end{equation}

Differentiating this scalar with respect to the $(j, l)$ entry $X_{jl}$ of $\boldsymbol{X}$:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}) = \frac{\partial}{\partial X_{jl}} \sum_{i=0}^{N-1} X_{ii} = \sum_{i=0}^{N-1} \frac{\partial X_{ii}}{\partial X_{jl}} \label{eq:5-1-2} \end{equation}

$X_{ii}$ and $X_{jl}$ are the same variable only when $i = j$ and $i = l$, i.e., $j = l$. Using the Kronecker delta:

\begin{equation} \frac{\partial X_{ii}}{\partial X_{jl}} = \delta_{ij} \delta_{il} \label{eq:5-1-3} \end{equation}

Substituting \eqref{eq:5-1-3} into \eqref{eq:5-1-2} and summing over $i$. Since $\delta_{ij} = 1$ only when $i = j$:

\begin{equation} \sum_{i=0}^{N-1} \delta_{ij} \delta_{il} = \delta_{jl} \label{eq:5-1-4} \end{equation}

Since $\delta_{jl}$ is the $(j, l)$ entry of the identity matrix $\boldsymbol{I}$:

\begin{equation} \delta_{jl} = I_{jl} \label{eq:5-1-5} \end{equation}

Since \eqref{eq:5-1-5} holds for all $(j, l)$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}) = \boldsymbol{I} \label{eq:5-1-6} \end{equation}

Remark: Since the trace is the sum of diagonal elements, differentiating with respect to a diagonal element $X_{jj}$ yields 1, while differentiating with respect to an off-diagonal element yields 0. This is why the result is the identity matrix $\boldsymbol{I}$ (diagonal entries equal to 1, off-diagonal entries equal to 0).

5.2 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top$
Conditions: $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{A}\boldsymbol{X} \in \mathbb{R}^{M \times M}$ is a square matrix
Proof

Writing out the $(i, i)$ entry (diagonal entry) of the matrix product $\boldsymbol{A}\boldsymbol{X}$:

\begin{equation} (\boldsymbol{A}\boldsymbol{X})_{ii} = \sum_{k=0}^{N-1} A_{ik} X_{ki} \label{eq:5-2-1} \end{equation}

Since the trace is the sum of diagonal elements:

\begin{equation} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} (\boldsymbol{A}\boldsymbol{X})_{ii} = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} X_{ki} \label{eq:5-2-2} \end{equation}

Differentiating this scalar with respect to the $(j, l)$ entry $X_{jl}$ of $\boldsymbol{X}$. Since $A_{ik}$ is a constant:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \frac{\partial X_{ki}}{\partial X_{jl}} \label{eq:5-2-3} \end{equation}

$X_{ki}$ and $X_{jl}$ are the same variable only when $(k, i) = (j, l)$. Using the Kronecker delta:

\begin{equation} \frac{\partial X_{ki}}{\partial X_{jl}} = \delta_{kj} \delta_{il} \label{eq:5-2-4} \end{equation}

Substituting \eqref{eq:5-2-4} into \eqref{eq:5-2-3}:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \delta_{kj} \delta_{il} \label{eq:5-2-5} \end{equation}

Since $\delta_{kj} = 1$ only when $k = j$, we get $\sum_{k=0}^{N-1} A_{ik} \delta_{kj} = A_{ij}$. Similarly, since $\delta_{il} = 1$ only when $i = l$:

\begin{equation} \sum_{i=0}^{M-1} A_{ij} \delta_{il} = A_{lj} \label{eq:5-2-6} \end{equation}

$A_{lj}$ is the $(j, l)$ entry of the transpose $\boldsymbol{A}^\top$:

\begin{equation} A_{lj} = (\boldsymbol{A}^\top)_{jl} \label{eq:5-2-7} \end{equation}

Since \eqref{eq:5-2-7} holds for all $(j, l)$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top \label{eq:5-2-8} \end{equation}

Remark: The transpose appears because, when summing diagonal elements in the definition of trace, the row index of $\boldsymbol{A}$ coincides with the column index of $\boldsymbol{X}$. In the derivative, the indices swap, yielding the transpose. When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), we have $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}$.

5.3 Derivative of $\text{tr}(\boldsymbol{X}\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{A}^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X}\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is a square matrix
Proof

Method 1: Using the cyclic property of trace

By the cyclic property of trace, for any matrices $\boldsymbol{P}, \boldsymbol{Q}$ we have $\text{tr}(\boldsymbol{P}\boldsymbol{Q}) = \text{tr}(\boldsymbol{Q}\boldsymbol{P})$. Applying this to $\boldsymbol{X}\boldsymbol{A}$:

\begin{equation} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \text{tr}(\boldsymbol{A}\boldsymbol{X}) \label{eq:5-3-1} \end{equation}

Applying the result of Formula 5.2:

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}^\top \label{eq:5-3-2} \end{equation}

Method 2: Direct computation

Writing out the $(i, i)$ entry of the matrix product $\boldsymbol{X}\boldsymbol{A}$:

\begin{equation} (\boldsymbol{X}\boldsymbol{A})_{ii} = \sum_{k=0}^{M-1} X_{ik} A_{ki} \label{eq:5-3-3} \end{equation}

Since the trace is the sum of diagonal elements:

\begin{equation} \text{tr}(\boldsymbol{X}\boldsymbol{A}) = \sum_{i=0}^{N-1} \sum_{k=0}^{M-1} X_{ik} A_{ki} \label{eq:5-3-4} \end{equation}

Differentiating with respect to $X_{jl}$ and substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$, then summing:

\begin{equation} \sum_{i=0}^{N-1} \sum_{k=0}^{M-1} \delta_{ij} \delta_{kl} A_{ki} = A_{lj} = (\boldsymbol{A}^\top)_{jl} \label{eq:5-3-5} \end{equation}

Remark: By the cyclic property of trace, $\text{tr}(\boldsymbol{X}\boldsymbol{A})$ and $\text{tr}(\boldsymbol{A}\boldsymbol{X})$ have the same value. Therefore, their derivatives are also the same.

5.4 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A}$
Conditions: $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is an $M \times N$ constant matrix, $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$, $\boldsymbol{A}\boldsymbol{X}^\top \in \mathbb{R}^{M \times M}$ is a square matrix
Proof

We verify the entries of the transpose. The $(k, i)$ entry of $\boldsymbol{X}^\top$ equals the $(i, k)$ entry of $\boldsymbol{X}$.

\begin{equation} (\boldsymbol{X}^\top)_{ki} = X_{ik} \label{eq:5-4-1} \end{equation}

Writing out the $(i, i)$ entry of the matrix product $\boldsymbol{A}\boldsymbol{X}^\top$:

\begin{equation} (\boldsymbol{A}\boldsymbol{X}^\top)_{ii} = \sum_{k=0}^{N-1} A_{ik} (\boldsymbol{X}^\top)_{ki} = \sum_{k=0}^{N-1} A_{ik} X_{ik} \label{eq:5-4-2} \end{equation}

Since the trace is the sum of diagonal elements:

\begin{equation} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} X_{ik} \label{eq:5-4-3} \end{equation}

Differentiating this expression with respect to $X_{jl}$. Since $A_{ik}$ is a constant:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \frac{\partial X_{ik}}{\partial X_{jl}} \label{eq:5-4-4} \end{equation}

Since $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$ (equals 1 only when $(i, k) = (j, l)$), substituting:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i=0}^{M-1} \sum_{k=0}^{N-1} A_{ik} \delta_{ij} \delta_{kl} \label{eq:5-4-5} \end{equation}

Summing over $i$ with $\delta_{ij}$, only the $i = j$ term survives; summing over $k$ with $\delta_{kl}$, only the $k = l$ term survives:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = A_{jl} \label{eq:5-4-6} \end{equation}

Since \eqref{eq:5-4-6} holds for all $(j, l)$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A} \label{eq:5-4-7} \end{equation}

Remark: Compared with the result of 5.2, replacing $\boldsymbol{X}$ by $\boldsymbol{X}^\top$ removes the transpose from the result. This is related to the fact that $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \sum_{i,k} A_{ik} X_{ik}$ equals the Frobenius inner product $\langle \boldsymbol{A}, \boldsymbol{X} \rangle_F$ of $\boldsymbol{A}$ and $\boldsymbol{X}$.

5.5 Derivative of $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{M \times N}$ is an $M \times N$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$, $\boldsymbol{A} \in \mathbb{R}^{N \times M}$ is an $N \times M$ constant matrix, $\boldsymbol{X}^\top\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is a square matrix
Proof

Applying the cyclic property of trace. Setting $\boldsymbol{P} = \boldsymbol{X}^\top$ and $\boldsymbol{Q} = \boldsymbol{A}$ in $\text{tr}(\boldsymbol{P}\boldsymbol{Q}) = \text{tr}(\boldsymbol{Q}\boldsymbol{P})$:

\begin{equation} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) \label{eq:5-5-1} \end{equation}

For $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A})$ to be defined, $\boldsymbol{X}^\top\boldsymbol{A}$ must be a square matrix. Since $\boldsymbol{X}^\top \in \mathbb{R}^{N \times M}$ and $\boldsymbol{A} \in \mathbb{R}^{M \times N}$, we have $\boldsymbol{X}^\top\boldsymbol{A} \in \mathbb{R}^{N \times N}$, so the trace is defined.

In this case, $\boldsymbol{A}\boldsymbol{X}^\top \in \mathbb{R}^{M \times M}$, and by the cyclic property both sides have the same scalar value. Applying the result of Formula 5.4:

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}) = \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{A} \label{eq:5-5-2} \end{equation}

Remark: This formula is frequently used in machine learning. For example, when $\boldsymbol{A}$ is a label matrix and $\boldsymbol{X}$ is a prediction matrix, $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A})$ represents the sum of inner products between predictions and labels, and its gradient is $\boldsymbol{A}$.

5.6 Derivative of $\text{tr}(\boldsymbol{X}^2)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix variable
Proof

Writing out the $(i, i)$ entry of $\boldsymbol{X}^2 = \boldsymbol{X} \cdot \boldsymbol{X}$:

\begin{equation} (\boldsymbol{X}^2)_{ii} = \sum_{k=0}^{N-1} X_{ik} X_{ki} \label{eq:5-6-1} \end{equation}

Since the trace is the sum of diagonal elements:

\begin{equation} \text{tr}(\boldsymbol{X}^2) = \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} X_{ki} \label{eq:5-6-2} \end{equation}

Differentiating with respect to $X_{jl}$. Since both $X_{ik}$ and $X_{ki}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25).

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}^2) = \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} \left( \frac{\partial X_{ik}}{\partial X_{jl}} X_{ki} + X_{ik} \frac{\partial X_{ki}}{\partial X_{jl}} \right) \label{eq:5-6-3} \end{equation}

Computing the first term. Substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{jl}} = \delta_{ij} \delta_{kl}$, only the $i = j$ and $k = l$ terms survive:

\begin{equation} \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} \delta_{ij} \delta_{kl} X_{ki} = X_{lj} \label{eq:5-6-4} \end{equation}

Computing the second term. Substituting $\displaystyle\frac{\partial X_{ki}}{\partial X_{jl}} = \delta_{kj} \delta_{il}$, only the $k = j$ and $i = l$ terms survive:

\begin{equation} \sum_{i=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} \delta_{kj} \delta_{il} = X_{lj} \label{eq:5-6-5} \end{equation}

Combining the first and second terms:

\begin{equation} \frac{\partial}{\partial X_{jl}} \text{tr}(\boldsymbol{X}^2) = X_{lj} + X_{lj} = 2X_{lj} \label{eq:5-6-6} \end{equation}

Since $X_{lj}$ is the $(j, l)$ entry of the transpose $\boldsymbol{X}^\top$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top \label{eq:5-6-7} \end{equation}

Remark: The factor of 2 appears because in $\text{tr}(\boldsymbol{X}^2) = \sum_{i,k} X_{ik} X_{ki}$, $X_{jl}$ contributes both as the first factor and as the second factor. When $\boldsymbol{X}$ is symmetric, $\boldsymbol{X}^\top = \boldsymbol{X}$, so the result becomes $2\boldsymbol{X}$.

5.7 Derivative of $\text{tr}(\boldsymbol{X}^2\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times N}$ is an $N \times N$ square matrix variable, $\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is an $N \times N$ constant matrix
Proof

Since the $(i, j)$ entry of $\boldsymbol{X}^2$ is $(\boldsymbol{X}^2)_{ij} = \sum_{k=0}^{N-1} X_{ik} X_{kj}$, the trace is:

\begin{equation} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} \sum_{k=0}^{N-1} X_{ik} X_{kj} A_{ji} \label{eq:5-7-1} \end{equation}

Differentiating with respect to $X_{pq}$. Since both $X_{ik}$ and $X_{kj}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25).

\begin{equation} \frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = \sum_{i,j,k} \left( \frac{\partial X_{ik}}{\partial X_{pq}} X_{kj} A_{ji} + X_{ik} \frac{\partial X_{kj}}{\partial X_{pq}} A_{ji} \right) \label{eq:5-7-2} \end{equation}

Computing the first term. Substituting $\displaystyle\frac{\partial X_{ik}}{\partial X_{pq}} = \delta_{ip} \delta_{kq}$ selects $i = p$ and $k = q$:

\begin{equation} \sum_{i,j,k} \delta_{ip} \delta_{kq} X_{kj} A_{ji} = \sum_{j} X_{qj} A_{jp} = (\boldsymbol{X}\boldsymbol{A})_{qp} \label{eq:5-7-3} \end{equation}

Computing the second term. Substituting $\displaystyle\frac{\partial X_{kj}}{\partial X_{pq}} = \delta_{kp} \delta_{jq}$ selects $k = p$ and $j = q$:

\begin{equation} \sum_{i,j,k} X_{ik} \delta_{kp} \delta_{jq} A_{ji} = \sum_{i} X_{ip} A_{qi} = (\boldsymbol{A}\boldsymbol{X})_{qp} \label{eq:5-7-4} \end{equation}

Combining the first and second terms:

\begin{equation} \frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A})_{qp} + (\boldsymbol{A}\boldsymbol{X})_{qp} \label{eq:5-7-5} \end{equation}

The $(q,p)$ entry is the $(p,q)$ entry of the transpose, and by linearity of the transpose:

\begin{equation} (\boldsymbol{X}\boldsymbol{A})_{qp} + (\boldsymbol{A}\boldsymbol{X})_{qp} = ((\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top)_{pq} \label{eq:5-7-6} \end{equation}

Since \eqref{eq:5-7-6} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2\boldsymbol{A}) = (\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X})^\top \label{eq:5-7-7} \end{equation}

Remark: When $\boldsymbol{A}$ is symmetric and $\boldsymbol{X}$ is also symmetric, $\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X}$ is symmetric, and the result simplifies to $\boldsymbol{X}\boldsymbol{A} + \boldsymbol{A}\boldsymbol{X}$. When $\boldsymbol{A} = \boldsymbol{I}$, this reduces to 5.6.

5.8 Derivative of $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X} \in \mathbb{R}^{N \times M}$ is an $N \times M$ matrix variable, $\boldsymbol{X}^\top \in \mathbb{R}^{M \times N}$, $\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is an $N \times N$ constant matrix, $\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X} \in \mathbb{R}^{M \times M}$ is a square matrix
Proof

Writing out the $(i, k)$ entry of $\boldsymbol{A}\boldsymbol{X}$:

\begin{equation} (\boldsymbol{A}\boldsymbol{X})_{ik} = \sum_{j=0}^{N-1} A_{ij} X_{jk} \label{eq:5-8-1} \end{equation}

The $(l, l)$ entry of $\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}$ is:

\begin{equation} (\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} (\boldsymbol{X}^\top)_{li} (\boldsymbol{A}\boldsymbol{X})_{il} \label{eq:5-8-2} \end{equation}

Substituting $(\boldsymbol{X}^\top)_{li} = X_{il}$ and using \eqref{eq:5-8-1}:

\begin{equation} (\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} X_{il} \sum_{j=0}^{N-1} A_{ij} X_{jl} \label{eq:5-8-3} \end{equation}

Rearranging the sums:

\begin{equation} (\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})_{ll} = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} X_{il} A_{ij} X_{jl} \label{eq:5-8-4} \end{equation}

Since the trace is the sum of diagonal elements, summing over $l$:

\begin{equation} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l=0}^{M-1} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} X_{il} A_{ij} X_{jl} \label{eq:5-8-5} \end{equation}

Differentiating with respect to $X_{pq}$. Since both $X_{il}$ and $X_{jl}$ are entries of $\boldsymbol{X}$, we apply the product rule (1.25):

\begin{equation} \frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l,i,j} \left( \frac{\partial X_{il}}{\partial X_{pq}} A_{ij} X_{jl} + X_{il} A_{ij} \frac{\partial X_{jl}}{\partial X_{pq}} \right) \label{eq:5-8-6} \end{equation}

Computing the first term. Substituting $\displaystyle\frac{\partial X_{il}}{\partial X_{pq}} = \delta_{ip} \delta_{lq}$:

\begin{equation} \sum_{l,i,j} \delta_{ip} \delta_{lq} A_{ij} X_{jl} = \sum_{j} A_{pj} X_{jq} \label{eq:5-8-7} \end{equation}

Here $\delta_{ip}$ selects $i = p$ and $\delta_{lq}$ selects $l = q$.

Rewriting \eqref{eq:5-8-7} as a matrix product:

\begin{equation} \sum_{j} A_{pj} X_{jq} = (\boldsymbol{A}\boldsymbol{X})_{pq} \label{eq:5-8-8} \end{equation}

Computing the second term. Substituting $\displaystyle\frac{\partial X_{jl}}{\partial X_{pq}} = \delta_{jp} \delta_{lq}$:

\begin{equation} \sum_{l,i,j} X_{il} A_{ij} \delta_{jp} \delta_{lq} = \sum_{i} X_{iq} A_{ip} \label{eq:5-8-9} \end{equation}

Here $\delta_{jp}$ selects $j = p$ and $\delta_{lq}$ selects $l = q$.

Transforming \eqref{eq:5-8-9}. Using $A_{ip} = (\boldsymbol{A}^\top)_{pi}$:

\begin{equation} \sum_{i} X_{iq} A_{ip} = \sum_{i} (\boldsymbol{A}^\top)_{pi} X_{iq} \label{eq:5-8-10} \end{equation}

Rewriting as a matrix product:

\begin{equation} \sum_{i} (\boldsymbol{A}^\top)_{pi} X_{iq} = (\boldsymbol{A}^\top\boldsymbol{X})_{pq} \label{eq:5-8-11} \end{equation}

Combining the first term \eqref{eq:5-8-8} and the second term \eqref{eq:5-8-11}:

\begin{equation} \frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = (\boldsymbol{A}\boldsymbol{X})_{pq} + (\boldsymbol{A}^\top\boldsymbol{X})_{pq} \label{eq:5-8-12} \end{equation}

Since \eqref{eq:5-8-12} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X} \label{eq:5-8-13} \end{equation}

Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result becomes $2\boldsymbol{A}\boldsymbol{X}$. This formula is a generalization of the quadratic form, and $\text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X}) = \sum_{l} \boldsymbol{x}_l^\top \boldsymbol{A} \boldsymbol{x}_l$ where $\boldsymbol{x}_l$ denotes the $l$-th column of $\boldsymbol{X}$.

5.9 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{X}^\top) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})$, so this gives the same result as 5.8.

5.10 Derivative of $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \boldsymbol{A}\boldsymbol{X} + \boldsymbol{A}^\top\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{A}) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{A}\boldsymbol{X})$, so this gives the same result as 5.8.

5.11 Derivative of $\text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

Writing the trace in component form:

\begin{eqnarray} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = \displaystyle\sum_{i,j,k} X_{ij} A_{jk} X_{ik} \end{eqnarray}

Differentiating with respect to $X_{pq}$, there are two types of terms containing $X_{pq}$.

Case 1: $X_{ij} = X_{pq}$ (i.e., $i = p, j = q$)

\begin{align} \frac{\partial}{\partial X_{pq}} \sum_{k} X_{pq} A_{qk} X_{pk} &= \sum_{k} A_{qk} X_{pk} \notag \\ &= \sum_{k} X_{pk} A_{qk} = \sum_{k} X_{pk} (\boldsymbol{A}^\top)_{kq} \notag \\ &= (\boldsymbol{X}\boldsymbol{A}^\top)_{pq} \notag \end{align}

Case 2: $X_{ik} = X_{pq}$ (i.e., $i = p, k = q$)

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \sum_{j} X_{pj} A_{jq} X_{pq} &=& \sum_{j} X_{pj} A_{jq} = (\boldsymbol{X}\boldsymbol{A})_{pq} \end{eqnarray}

Combining the above:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top) = (\boldsymbol{X}\boldsymbol{A}^\top)_{pq} + (\boldsymbol{X}\boldsymbol{A})_{pq} \end{eqnarray}
Remark: When $\boldsymbol{A}$ is symmetric, $\boldsymbol{A} = \boldsymbol{A}^\top$, and the result becomes $2\boldsymbol{X}\boldsymbol{A}$.

5.12 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{X}) = \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)$, so this gives the same result as 5.11.

5.13 Derivative of $\text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \boldsymbol{X}\boldsymbol{A}^\top + \boldsymbol{X}\boldsymbol{A}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}^\top\boldsymbol{X}\boldsymbol{A}) = \text{tr}(\boldsymbol{X}\boldsymbol{A}\boldsymbol{X}^\top)$, so this gives the same result as 5.11.

5.14 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top + \boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top$
Conditions: $\boldsymbol{A}, \boldsymbol{X}, \boldsymbol{B}$ are $N \times N$ square matrices
Proof

Writing the trace in component form:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = \displaystyle\sum_{i,j,k,l} A_{ij} X_{jk} B_{kl} X_{li} \end{eqnarray}

Differentiating with respect to $X_{pq}$, there are two types of terms containing $X_{pq}$.

Case 1: $X_{jk} = X_{pq}$ (i.e., $j = p, k = q$)

\begin{align} \frac{\partial}{\partial X_{pq}} \sum_{i,l} A_{ip} X_{pq} B_{ql} X_{li} &= \sum_{i,l} A_{ip} B_{ql} X_{li} \notag \\ &= \sum_{i} A_{ip} \sum_{l} B_{ql} X_{li} = \sum_{i} A_{ip} (\boldsymbol{B}\boldsymbol{X})_{qi} \notag \\ &= \sum_{i} (\boldsymbol{A}^\top)_{pi} (\boldsymbol{X}^\top\boldsymbol{B}^\top)_{iq} = (\boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top)_{pq} \notag \end{align}

Case 2: $X_{li} = X_{pq}$ (i.e., $l = p, i = q$)

\begin{align} \frac{\partial}{\partial X_{pq}} \sum_{j,k} A_{qj} X_{jk} B_{kp} X_{pq} &= \sum_{j,k} A_{qj} X_{jk} B_{kp} \notag \\ &= \sum_{k} B_{kp} \sum_{j} A_{qj} X_{jk} = \sum_{k} B_{kp} (\boldsymbol{A}\boldsymbol{X})_{qk} \notag \\ &= \sum_{k} (\boldsymbol{B}^\top)_{pk} (\boldsymbol{X}^\top\boldsymbol{A}^\top)_{kq} = (\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top)_{pq} \notag \end{align}

Combining the above:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}) = (\boldsymbol{A}^\top\boldsymbol{X}^\top\boldsymbol{B}^\top)_{pq} + (\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{A}^\top)_{pq} \end{eqnarray}

5.15 Derivative of $\text{tr}(\boldsymbol{X}^\top\boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = 2\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is a matrix of arbitrary size
Proof

Setting $\boldsymbol{B} = \boldsymbol{I}$ in 5.8 gives $\boldsymbol{I}\boldsymbol{X} + \boldsymbol{I}^\top\boldsymbol{X} = 2\boldsymbol{X}$.

5.16 Derivative of $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is a matrix of arbitrary size
Proof

By the cyclic property of trace (1.12), $\text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X})$, so this gives the same result as 5.15.

5.17 Derivative of $\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$
Conditions: $\boldsymbol{B}, \boldsymbol{C}$ are constant matrices
Proof

Setting $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{B}$, we get $\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{Y})$. By 5.8, $\displaystyle\frac{\partial}{\partial \boldsymbol{Y}} \text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{Y}) = (\boldsymbol{C} + \boldsymbol{C}^\top)\boldsymbol{Y}$. Applying the chain rule (matrix version of 1.26):

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top \end{eqnarray}

5.18 Derivative of $\text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \boldsymbol{B}\boldsymbol{X}\boldsymbol{C} + \boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $N \times N$ constant matrix, $\boldsymbol{C}$ is an $M \times M$ constant matrix
Proof

Writing the trace in component form, $\text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = \displaystyle\sum_{i,j,k,l} X_{ji} B_{jk} X_{kl} C_{li}$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}\boldsymbol{C}) = (\boldsymbol{B}\boldsymbol{X}\boldsymbol{C})_{pq} + (\boldsymbol{B}^\top\boldsymbol{X}\boldsymbol{C}^\top)_{pq} \end{eqnarray}

5.19 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top + \boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{A}, \boldsymbol{C}$ are $M \times M$ constant matrices, $\boldsymbol{B}$ is an $N \times N$ constant matrix
Proof

Writing the trace in component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = \displaystyle\sum_{i,j,k,l,m} A_{ij} X_{jk} B_{kl} X_{ml} C_{mi}$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}\boldsymbol{X}^\top\boldsymbol{C}) = (\boldsymbol{A}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}^\top)_{pq} + (\boldsymbol{C}\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})_{pq} \end{eqnarray}

5.20 Derivative of the Frobenius Norm

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})^\top] = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C})\boldsymbol{B}^\top$
Conditions: $\boldsymbol{A}, \boldsymbol{B}, \boldsymbol{C}$ are constant matrices
Proof

Setting $\boldsymbol{Y} = \boldsymbol{A}\boldsymbol{X}\boldsymbol{B}+\boldsymbol{C}$, we get $\text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top) = \|\boldsymbol{Y}\|_F^2$. By 5.16, $\displaystyle\frac{\partial}{\partial \boldsymbol{Y}} \text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top) = 2\boldsymbol{Y}$. Applying the chain rule (1.26), since $\displaystyle\frac{\partial Y_{ij}}{\partial X_{pq}} = A_{ip} B_{qj}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{Y}\boldsymbol{Y}^\top) = 2 (\boldsymbol{A}^\top \boldsymbol{Y} \boldsymbol{B}^\top)_{pq} \end{eqnarray}

5.21 Derivative of the Trace of a Kronecker Product

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = 2\text{tr}(\boldsymbol{X})\boldsymbol{I}$
Conditions: $\boldsymbol{X}$ is a square matrix, $\otimes$ denotes the Kronecker product
Proof

By the trace property of the Kronecker product $\text{tr}(\boldsymbol{A} \otimes \boldsymbol{B}) = \text{tr}(\boldsymbol{A})\text{tr}(\boldsymbol{B})$, we have $\text{tr}(\boldsymbol{X} \otimes \boldsymbol{X}) = [\text{tr}(\boldsymbol{X})]^2$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} [\text{tr}(\boldsymbol{X})]^2 = 2\text{tr}(\boldsymbol{X}) \cdot \delta_{pq} = (2\text{tr}(\boldsymbol{X})\boldsymbol{I})_{pq} \end{eqnarray}

5.22 Derivative of $\text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X})$ (Alternative Proof by Component Computation)

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

We have $\text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) = \displaystyle\sum_{i,j,k} X_{ik} A_{ij} X_{jk}$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^\top \boldsymbol{A} \boldsymbol{X}) &=& (\boldsymbol{A} \boldsymbol{X})_{pq} + (\boldsymbol{A}^\top \boldsymbol{X})_{pq} = ((\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{X})_{pq} \end{eqnarray}
Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result becomes $2\boldsymbol{A}\boldsymbol{X}$.

5.23 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{A}^\top \boldsymbol{B}^\top$
Conditions: $\boldsymbol{A}$ is an $L \times N$ constant matrix, $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $M \times L$ constant matrix
Proof

In component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \displaystyle\sum_{l,i,j} A_{li} X_{ij} B_{jl}$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}\boldsymbol{B}) = \displaystyle\sum_{l} A_{lp} B_{ql} = (\boldsymbol{A}^\top \boldsymbol{B}^\top)_{pq} \end{eqnarray}

5.24 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \boldsymbol{B}\boldsymbol{A}$
Conditions: $\boldsymbol{A}$ is an $M \times N$ constant matrix, $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$ is an $N \times M$ constant matrix
Proof

Writing the trace in component form, $\text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \displaystyle\sum_{i,j,k} A_{ij} X_{kj} B_{ki}$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^\top\boldsymbol{B}) = \displaystyle\sum_{i} B_{pi} A_{iq} = (\boldsymbol{B}\boldsymbol{A})_{pq} \end{eqnarray}

5.25 Derivative of $\text{tr}(\boldsymbol{A} \otimes \boldsymbol{X})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A})\boldsymbol{I}$
Conditions: $\boldsymbol{A}$ is an $M \times M$ constant matrix, $\boldsymbol{X}$ is an $N \times N$ matrix, $\otimes$ denotes the Kronecker product
Proof

By the trace property of the Kronecker product, $\text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A}) \cdot \text{tr}(\boldsymbol{X})$. Differentiating with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A} \otimes \boldsymbol{X}) = \text{tr}(\boldsymbol{A}) \cdot \delta_{pq} = (\text{tr}(\boldsymbol{A})\boldsymbol{I})_{pq} \end{eqnarray}

5.26 Derivative of $\text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A}) = -\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ invertible matrix, $\boldsymbol{A}$ is an $N \times N$ constant matrix
Proof

Differentiating the identity $\boldsymbol{X} \boldsymbol{X}^{-1} = \boldsymbol{I}$ with respect to $X_{pq}$ gives $\displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} = -\boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1}$. Using this to differentiate $\text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A})$ with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^{-1}\boldsymbol{A}) &=& -\text{tr}(\boldsymbol{X}^{-1} \boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq}) = -(\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{X}^{-\top})_{pq} \end{eqnarray}

5.27 Derivative of $\text{tr}(\boldsymbol{X}^k)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $k$ is a positive integer
Proof

Applying the chain rule (1.26) to the derivative of a matrix power gives $\displaystyle\frac{\partial \boldsymbol{X}^k}{\partial X_{pq}} = \displaystyle\sum_{r=0}^{k-1} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1}$. Using the cyclic property of trace:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{X}^k) = k \cdot \text{tr}(\boldsymbol{X}^{k-1} \boldsymbol{E}_{pq}) = k (\boldsymbol{X}^{k-1})_{qp} = k ((\boldsymbol{X}^{k-1})^\top)_{pq} \end{eqnarray}
Remark: For $k = 2$, we get $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^2) = 2\boldsymbol{X}^\top$, which is consistent with 5.6.

5.28 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}^k)$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$
Conditions: $\boldsymbol{X}$, $\boldsymbol{A}$ are $N \times N$ matrices, $k$ is a positive integer
Proof

Similarly to 5.27, we compute the derivative of the matrix power.

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\frac{\partial \boldsymbol{X}^k}{\partial X_{pq}} \right) \\ &=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{r=0}^{k-1} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1} \right) \\ &=& \displaystyle\sum_{r=0}^{k-1} \text{tr}(\boldsymbol{A} \boldsymbol{X}^r \boldsymbol{E}_{pq} \boldsymbol{X}^{k-r-1}) \\ &=& \displaystyle\sum_{r=0}^{k-1} \text{tr}(\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r \boldsymbol{E}_{pq}) \quad (\text{cyclic property of trace}) \end{eqnarray}

Using $\text{tr}(\boldsymbol{M} \boldsymbol{E}_{pq}) = M_{qp}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) &=& \displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r)_{qp} \\ &=& \displaystyle\sum_{r=0}^{k-1} ((\boldsymbol{X}^{k-r-1} \boldsymbol{A} \boldsymbol{X}^r)^\top)_{pq} \\ &=& \displaystyle\sum_{r=0}^{k-1} ((\boldsymbol{X}^r)^\top \boldsymbol{A}^\top (\boldsymbol{X}^{k-r-1})^\top)_{pq} \end{eqnarray}

Performing the substitution $s = k - r - 1$ ($r = k - s - 1$):

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) &=& \displaystyle\sum_{s=0}^{k-1} ((\boldsymbol{X}^{k-s-1})^\top \boldsymbol{A}^\top (\boldsymbol{X}^s)^\top)_{pq} \\ &=& \displaystyle\sum_{s=0}^{k-1} ((\boldsymbol{X}^s \boldsymbol{A} \boldsymbol{X}^{k-s-1})^\top)_{pq} \end{eqnarray}

Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) = \displaystyle\sum_{r=0}^{k-1} (\boldsymbol{X}^r \boldsymbol{A} \boldsymbol{X}^{k-r-1})^\top$.

Remark: When $\boldsymbol{A}$ is symmetric ($\boldsymbol{A} = \boldsymbol{A}^\top$), the result may simplify. In particular, when $\boldsymbol{A} = \boldsymbol{I}$, this gives $\text{tr}(\boldsymbol{X}^k)$, which is consistent with 5.27.

5.29 Derivative of $\text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B})$

Formula: $$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) = \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}$$ $$\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X} + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top$$
Conditions: $\boldsymbol{X}$ is an $M \times N$ matrix, $\boldsymbol{B}$ is an $N \times K$ matrix, $\boldsymbol{C}$ is an $M \times M$ matrix
Proof

Since $\boldsymbol{X}$ appears in four positions in this compound expression, the derivative is the sum of the results obtained by differentiating at each position. Setting $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{B}$, the original expression can be written as $\text{tr}(\boldsymbol{Y}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{Y})$. We differentiate with respect to each occurrence of $\boldsymbol{X}$.

Term 1 (differentiation at the leftmost $\boldsymbol{X}^\top$):

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) &=& \text{tr}(\boldsymbol{B}^\top \boldsymbol{E}_{pq}^\top \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) \end{eqnarray}

Using the cyclic property of trace and $\text{tr}(\boldsymbol{E}_{qp}\boldsymbol{M}) = M_{pq}$, this term gives $(\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top)_{pq}$.

Term 2 (differentiation at $\boldsymbol{X}$ in $\boldsymbol{C}\boldsymbol{X}$):

\begin{eqnarray} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{E}_{pq}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) \end{eqnarray}

Using the cyclic property of trace, this term gives $(\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X})_{pq}$.

Term 3 (differentiation at $\boldsymbol{X}^\top$ in $\boldsymbol{X}\boldsymbol{X}^\top$):

\begin{eqnarray} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{E}_{pq}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) \end{eqnarray}

This term gives $(\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})_{pq}$.

Term 4 (differentiation at the rightmost $\boldsymbol{X}$):

\begin{eqnarray} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{E}_{pq}\boldsymbol{B}) \end{eqnarray}

This term gives $(\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top)_{pq}$.

Combining all four terms:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}) &= \boldsymbol{C}\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X} \notag \\ &\quad + \boldsymbol{C}\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X} + \boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{X}^\top\boldsymbol{C}^\top\boldsymbol{X}\boldsymbol{B}\boldsymbol{B}^\top \notag \end{align}
Remark: Although this formula is complex, each term is the result of applying the chain rule at each of the four positions where $\boldsymbol{X}$ appears. When $\boldsymbol{C}$ is symmetric ($\boldsymbol{C} = \boldsymbol{C}^\top$), the first and fourth terms, and the second and third terms, respectively take similar forms.

5.30 Derivative of $\text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B})$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ invertible matrix, $\boldsymbol{A}$, $\boldsymbol{B}$ are constant matrices of appropriate size
Proof

We use the inverse matrix derivative formula derived in 8.2:

\begin{eqnarray} \displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} = -\boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1} \end{eqnarray}

where $\boldsymbol{E}_{pq}$ is the matrix with 1 only at position $(p, q)$. Differentiating $\text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B})$ with respect to $X_{pq}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\frac{\partial \boldsymbol{X}^{-1}}{\partial X_{pq}} \boldsymbol{B} \right) \\ &=& \text{tr}(-\boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq} \boldsymbol{X}^{-1} \boldsymbol{B}) \\ &=& -\text{tr}(\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1} \boldsymbol{E}_{pq}) \quad (\text{cyclic property of trace}) \end{eqnarray}

Since $\text{tr}(\boldsymbol{M} \boldsymbol{E}_{pq}) = M_{qp}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) &=& -(\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1})_{qp} \\ &=& -((\boldsymbol{X}^{-1} \boldsymbol{B} \boldsymbol{A} \boldsymbol{X}^{-1})^\top)_{pq} \\ &=& -(\boldsymbol{X}^{-\top} \boldsymbol{A}^\top \boldsymbol{B}^\top \boldsymbol{X}^{-\top})_{pq} \end{eqnarray}

Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{-1}\boldsymbol{B}) = -\boldsymbol{X}^{-\top}\boldsymbol{A}^\top\boldsymbol{B}^\top\boldsymbol{X}^{-\top}$.

Remark: This is equivalent to $-(\boldsymbol{X}^{-1}\boldsymbol{B}\boldsymbol{A}\boldsymbol{X}^{-1})^\top$. When $\boldsymbol{A} = \boldsymbol{I}$, this reduces to the formula in 4.4.

5.31 Derivative of $\text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}]$ ($\boldsymbol{C}$: symmetric)

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] = -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{C}$ is an $N \times N$ symmetric matrix, $\boldsymbol{A}$ is an $M \times M$ matrix
Proof

Let $\boldsymbol{W} = \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ (an $M \times M$ matrix). First, we compute the derivative of $\boldsymbol{W}$ with respect to $X_{pq}$. Since $W_{ij} = \displaystyle\sum_{k,l} X_{ki} C_{kl} X_{lj}$:

\begin{eqnarray} \displaystyle\frac{\partial W_{ij}}{\partial X_{pq}} &=& \displaystyle\sum_l C_{pl} X_{lj} \delta_{iq} + \displaystyle\sum_k X_{ki} C_{kp} \delta_{jq} \\ &=& (\boldsymbol{C}\boldsymbol{X})_{pj} \delta_{iq} + (\boldsymbol{X}^\top\boldsymbol{C})_{ip} \delta_{jq} \end{eqnarray}

Using the inverse matrix derivative formula $\displaystyle\frac{\partial \boldsymbol{W}^{-1}}{\partial W_{ij}} = -\boldsymbol{W}^{-1} \boldsymbol{E}_{ij} \boldsymbol{W}^{-1}$ and the chain rule:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A}) &=& \displaystyle\sum_{i,j} \displaystyle\frac{\partial \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A})}{\partial W_{ij}} \cdot \displaystyle\frac{\partial W_{ij}}{\partial X_{pq}} \end{eqnarray}

Substituting $\displaystyle\frac{\partial}{\partial W_{ij}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A}) = -(\boldsymbol{W}^{-1}\boldsymbol{A}\boldsymbol{W}^{-1})_{ji}$ and using $\boldsymbol{C} = \boldsymbol{C}^\top$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial X_{pq}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{A}) &=& -(\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{A}\boldsymbol{W}^{-1})_{pq} - (\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{A}^\top\boldsymbol{W}^{-1})_{pq} \end{eqnarray}

Therefore $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{A}] = -\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{A}+\boldsymbol{A}^\top)(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$.

5.32 Derivative of $\text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]$ ($\boldsymbol{B}, \boldsymbol{C}$: symmetric)

Formula: \begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] &= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\ &\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \end{align}
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{B}$, $\boldsymbol{C}$ are $N \times N$ symmetric matrices
Proof

Let $\boldsymbol{W} = \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ and $\boldsymbol{V} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X}$. By the product rule:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{V}) &=& \left(\displaystyle\frac{\partial \boldsymbol{W}^{-1}}{\partial \boldsymbol{X}}\right) \boldsymbol{V} + \boldsymbol{W}^{-1} \left(\displaystyle\frac{\partial \boldsymbol{V}}{\partial \boldsymbol{X}}\right) \end{eqnarray}

Term 1 (derivative of $\boldsymbol{W}^{-1}$, with $\boldsymbol{V}$ fixed): Substituting $\boldsymbol{A} = \boldsymbol{V} = \boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}$ (symmetric) into the result of 5.31:

\begin{eqnarray} -\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}(\boldsymbol{V}+\boldsymbol{V}^\top)\boldsymbol{W}^{-1} &=& -2\boldsymbol{C}\boldsymbol{X}\boldsymbol{W}^{-1}\boldsymbol{V}\boldsymbol{W}^{-1} \\ &=& -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \end{eqnarray}

Term 2 (derivative of $\boldsymbol{V}$, with $\boldsymbol{W}^{-1}$ fixed): By 5.22, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{W}^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}) = 2\boldsymbol{B}\boldsymbol{X}\boldsymbol{W}^{-1}$ (when $\boldsymbol{B}$ is symmetric).

Therefore:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] &= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\ &\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \end{align}

5.33 Derivative of $\text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})]$ ($\boldsymbol{B}, \boldsymbol{C}$: symmetric)

Formula: \begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] &= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\ &\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \end{align}
Conditions: $\boldsymbol{X}$ is an $N \times M$ matrix, $\boldsymbol{A}$ is an $M \times M$ constant matrix, $\boldsymbol{B}$, $\boldsymbol{C}$ are $N \times N$ symmetric matrices
Proof

Let $\boldsymbol{W} = \boldsymbol{A} + \boldsymbol{X}^\top \boldsymbol{C} \boldsymbol{X}$ and $\boldsymbol{V} = \boldsymbol{X}^\top \boldsymbol{B} \boldsymbol{X}$. In the derivative of $\boldsymbol{W}$ with respect to $X_{pq}$, the constant term $\boldsymbol{A}$ vanishes:

\begin{eqnarray} \displaystyle\frac{\partial W_{ij}}{\partial X_{pq}} = (\boldsymbol{C}\boldsymbol{X})_{pj} \delta_{iq} + (\boldsymbol{X}^\top\boldsymbol{C})_{ip} \delta_{jq} \end{eqnarray}

This has the same form as in 5.32, so the result is obtained simply by replacing $\boldsymbol{W} = \boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}$ with $\boldsymbol{W} = \boldsymbol{A} + \boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X}$:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}[(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}(\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X})] &= -2\boldsymbol{C}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}\boldsymbol{X}^\top\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \\ &\quad + 2\boldsymbol{B}\boldsymbol{X}(\boldsymbol{A}+\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1} \notag \end{align}
Remark: This family of formulas plays an important role in the derivation of least squares and generalized least squares (GLS) estimators. The form $(\boldsymbol{X}^\top\boldsymbol{C}\boldsymbol{X})^{-1}$ appears in the variance-covariance matrix of the weighted least squares estimator.

Trace Derivatives of Elementary Functions

We discuss the derivative of the trace of a matrix function $f(\boldsymbol{X})$. Matrix functions are defined by their Taylor series, and for a diagonalizable matrix $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, $f(\boldsymbol{X}) = \boldsymbol{P} f(\boldsymbol{\Lambda}) \boldsymbol{P}^{-1}$, where $f(\boldsymbol{\Lambda})$ is the diagonal matrix obtained by applying $f$ to the eigenvalues.

In general, the derivative of the trace of a matrix function, when the matrix is diagonalizable with distinct eigenvalues, is given by:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X})) = f'(\boldsymbol{X})^\top \end{eqnarray}

where $f'(\boldsymbol{X})$ is the derivative of $f$ applied to the matrix $\boldsymbol{X}$. We prove this for individual functions below.

5.34 Derivative of $\text{tr}(\exp(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X})) = \exp(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

The matrix exponential is defined by its Taylor series:

\begin{eqnarray} \exp(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^k}{k!} = \boldsymbol{I} + \boldsymbol{X} + \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^3}{3!} + \cdots \end{eqnarray}

Taking the trace and differentiating term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\exp(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{k}{k!} (\boldsymbol{X}^{k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(k-1)!} (\boldsymbol{X}^{k-1})^\top \\ &=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{m!} (\boldsymbol{X}^m)^\top \quad (m = k-1) \\ &=& \exp(\boldsymbol{X})^\top \end{eqnarray}

5.35 Derivative of $\text{tr}(\log(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X})) = \boldsymbol{X}^{-\top}$
Conditions: $\boldsymbol{X}$ is an $N \times N$ positive definite matrix
Proof

We consider the matrix logarithm when $\boldsymbol{X}$ is positive definite. Using the trace property $\text{tr}(\log(\boldsymbol{X})) = \log(|\boldsymbol{X}|)$, which follows from the diagonalization $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$: $\text{tr}(\log(\boldsymbol{X})) = \displaystyle\sum_i \log(\lambda_i) = \log(\prod_i \lambda_i) = \log(|\boldsymbol{X}|)$.

Using the determinant derivative formula $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} |\boldsymbol{X}| = |\boldsymbol{X}| \boldsymbol{X}^{-\top}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\log(\boldsymbol{X})) &=& \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \log(|\boldsymbol{X}|) \\ &=& \displaystyle\frac{1}{|\boldsymbol{X}|} \cdot |\boldsymbol{X}| \boldsymbol{X}^{-\top} \\ &=& \boldsymbol{X}^{-\top} \end{eqnarray}

5.36 Derivative of $\text{tr}(\sqrt{\boldsymbol{X}})$ ($\boldsymbol{X}$: positive definite)

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sqrt{\boldsymbol{X}}) = \displaystyle\frac{1}{2}(\boldsymbol{X}^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ positive definite matrix
Proof

When $\boldsymbol{X}$ is positive definite, a unique positive definite square root $\boldsymbol{X}^{1/2}$ exists. Using the generalization of 5.27 with $n = 1/2$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^{1/2}) &=& \displaystyle\frac{1}{2} (\boldsymbol{X}^{1/2-1})^\top \\ &=& \displaystyle\frac{1}{2} (\boldsymbol{X}^{-1/2})^\top \end{eqnarray}

5.37 Derivative of $\text{tr}(\sin(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X})) = \cos(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

The matrix sine function is defined by its Taylor series:

\begin{eqnarray} \sin(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \boldsymbol{X}^{2k+1} = \boldsymbol{X} - \displaystyle\frac{\boldsymbol{X}^3}{3!} + \displaystyle\frac{\boldsymbol{X}^5}{5!} - \cdots \end{eqnarray}

Taking the trace and applying the formula from 5.27, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^n) = n(\boldsymbol{X}^{n-1})^\top$, term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sin(\boldsymbol{X})) &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} (2k+1)(\boldsymbol{X}^{2k})^\top \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (\boldsymbol{X}^{2k})^\top \\ &=& \left( \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k} \right)^\top \\ &=& \cos(\boldsymbol{X})^\top \end{eqnarray}

5.38 Derivative of $\text{tr}(\cos(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X})) = -\sin(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

The matrix cosine function is defined by its Taylor series:

\begin{eqnarray} \cos(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k} = \boldsymbol{I} - \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^4}{4!} - \cdots \end{eqnarray}

Taking the trace and differentiating term by term. The $k=0$ term is constant $\boldsymbol{I}$, so its derivative is $\boldsymbol{O}$:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cos(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (2k)(\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k-1)!} (\boldsymbol{X}^{2k-1})^\top \\ &=& -\displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{(-1)^m}{(2m+1)!} (\boldsymbol{X}^{2m+1})^\top \quad (m = k-1) \\ &=& -\sin(\boldsymbol{X})^\top \end{eqnarray}

General Formula: Trace Derivative of a Matrix Function

Formula: $$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X})) = f'(\boldsymbol{X})^\top$$ More generally, when $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$: $$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = (\boldsymbol{A}f'(\boldsymbol{X}))^\top$$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $f$ is analytic (has a Taylor series expansion). For the version with $\boldsymbol{A}$, $\boldsymbol{A}$ and $\boldsymbol{X}$ must commute.
Proof

Since $f$ is analytic, it has a Taylor series expansion $f(x) = \displaystyle\sum_{k=0}^{\infty} c_k x^k$. The matrix function is defined as $f(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} c_k \boldsymbol{X}^k$, so:

\begin{align} \text{tr}(f(\boldsymbol{X})) = \sum_{k=0}^{\infty} c_k \,\text{tr}(\boldsymbol{X}^k) \notag \end{align}

By the method of 5.34 (term-by-term differentiation of power traces), $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}^k) = k(\boldsymbol{X}^{k-1})^\top$. This gives the same result whether one differentiates $\text{tr}(\boldsymbol{X}^k) = \displaystyle\sum_i \lambda_i^k$ as a scalar or differentiates each term of the Taylor series directly.

Applying term-by-term differentiation:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(f(\boldsymbol{X})) &= \sum_{k=1}^{\infty} c_k \cdot k (\boldsymbol{X}^{k-1})^\top \notag \\ &= \left( \sum_{k=1}^{\infty} k\, c_k \boldsymbol{X}^{k-1} \right)^\top = f'(\boldsymbol{X})^\top \notag \end{align}

Here $f'(x) = \displaystyle\sum_{k=1}^{\infty} k\, c_k x^{k-1}$ is the scalar derivative of $f$, and $f'(\boldsymbol{X})$ is obtained by substituting $\boldsymbol{X}$ into this series.

For the version with $\boldsymbol{A}$: when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, simultaneous diagonalization is possible: $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$, $\boldsymbol{A} = \boldsymbol{P}\boldsymbol{D}\boldsymbol{P}^{-1}$ ($\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_N)$, $\boldsymbol{D} = \text{diag}(d_1, \ldots, d_N)$). Then:

\begin{align} \text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = \sum_{i=1}^{N} d_i f(\lambda_i) \notag \end{align}

Taking the scalar derivative $f'(\lambda_i)$ with respect to each $\lambda_i$ and reconstructing in matrix form, the same argument gives $(\boldsymbol{A}f'(\boldsymbol{X}))^\top$. $\square$

Remark: By this general formula, all formulas 5.39--5.58 below follow by simply substituting the appropriate $f$ and $f'$.

5.39 Derivative of $\text{tr}(\tan(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\cos(\boldsymbol{X})$ is invertible
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\tan(x) = \sec^2(x)$. Substituting $f(x) = \tan(x)$, $f'(x) = \sec^2(x)$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tan(\boldsymbol{X})) = \sec^2(\boldsymbol{X})^\top \qquad \square \notag \end{align}

Here $\sec^2(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-2}$ is defined as the square of the inverse of the matrix cosine.

Remark: Here $\sec(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-1}$.

5.40 Derivative of $\text{tr}(\arcsin(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arcsin(x) = \displaystyle\frac{1}{\sqrt{1-x^2}}$. Substituting $f(x) = \arcsin(x)$, $f'(x) = (1-x^2)^{-1/2}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arcsin(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.

5.41 Derivative of $\text{tr}(\arccos(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arccos(x) = -\displaystyle\frac{1}{\sqrt{1-x^2}}$. Substituting $f(x) = \arccos(x)$, $f'(x) = -(1-x^2)^{-1/2}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arccos(\boldsymbol{X})) = -((\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag \end{align}

Here the matrix $f'(\boldsymbol{X})$ is the same $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ as in 5.40, differing only in sign.

5.42 Derivative of $\text{tr}(\arctan(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arctan(x) = \displaystyle\frac{1}{1+x^2}$. Substituting $f(x) = \arctan(x)$, $f'(x) = (1+x^2)^{-1}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\arctan(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.

5.43 Derivative of $\text{tr}(\sinh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X})) = \cosh(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

The matrix hyperbolic sine function is defined by its Taylor series:

\begin{eqnarray} \sinh(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^{2k+1}}{(2k+1)!} = \boldsymbol{X} + \displaystyle\frac{\boldsymbol{X}^3}{3!} + \displaystyle\frac{\boldsymbol{X}^5}{5!} + \cdots \end{eqnarray}

Differentiating term by term as in 5.37:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\sinh(\boldsymbol{X})) &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(2k+1)}{(2k+1)!} (\boldsymbol{X}^{2k})^\top \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} (\boldsymbol{X}^{2k})^\top \\ &=& \cosh(\boldsymbol{X})^\top \end{eqnarray}

5.44 Derivative of $\text{tr}(\cosh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X})) = \sinh(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

The matrix hyperbolic cosine function is defined by its Taylor series:

\begin{eqnarray} \cosh(\boldsymbol{X}) = \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{\boldsymbol{X}^{2k}}{(2k)!} = \boldsymbol{I} + \displaystyle\frac{\boldsymbol{X}^2}{2!} + \displaystyle\frac{\boldsymbol{X}^4}{4!} + \cdots \end{eqnarray}

Differentiating term by term as in 5.38:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\cosh(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(2k)}{(2k)!} (\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k-1)!} (\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{(2m+1)!} (\boldsymbol{X}^{2m+1})^\top \quad (m = k-1) \\ &=& \sinh(\boldsymbol{X})^\top \end{eqnarray}

5.45 Derivative of $\text{tr}(\tanh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\cosh(\boldsymbol{X})$ is invertible
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\tanh(x) = \text{sech}^2(x)$. Substituting $f(x) = \tanh(x)$, $f'(x) = \text{sech}^2(x)$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\tanh(\boldsymbol{X})) = \text{sech}^2(\boldsymbol{X})^\top \qquad \square \notag \end{align}

Here $\text{sech}^2(\boldsymbol{X}) = \cosh(\boldsymbol{X})^{-2}$ is defined as the square of the inverse of the matrix hyperbolic cosine.

Remark: Here $\text{sech}(\boldsymbol{X}) = \cosh(\boldsymbol{X})^{-1}$.

5.46 Derivative of $\text{tr}(\text{arcsinh}(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\text{arcsinh}(x) = \displaystyle\frac{1}{\sqrt{1+x^2}}$. Substituting $f(x) = \text{arcsinh}(x)$, $f'(x) = (1+x^2)^{-1/2}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arcsinh}(\boldsymbol{X})) = ((\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.

5.47 Derivative of $\text{tr}(\text{arccosh}(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, all eigenvalues are greater than $1$
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\text{arccosh}(x) = \displaystyle\frac{1}{\sqrt{x^2-1}}$ ($x > 1$). Substituting $f(x) = \text{arccosh}(x)$, $f'(x) = (x^2-1)^{-1/2}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arccosh}(\boldsymbol{X})) = ((\boldsymbol{X}^2-\boldsymbol{I})^{-1/2})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{X}^2-\boldsymbol{I})^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{X}^2-\boldsymbol{I}$, which is defined when all eigenvalues are greater than $1$.

5.48 Derivative of $\text{tr}(\text{arctanh}(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\|\boldsymbol{X}\| < 1$
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\text{arctanh}(x) = \displaystyle\frac{1}{1-x^2}$ ($|x| < 1$). Substituting $f(x) = \text{arctanh}(x)$, $f'(x) = (1-x^2)^{-1}$ into the general formula:

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\text{arctanh}(\boldsymbol{X})) = ((\boldsymbol{I}-\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.

5.49 Derivative of $\text{tr}(\boldsymbol{A}\sin(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) = (\boldsymbol{A}\cos(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

Using the Taylor series as in 5.37:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \boldsymbol{X}^{2k+1} \right) \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k+1}) \end{eqnarray}

Using the formula from 5.28, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\boldsymbol{X}^n) = n(\boldsymbol{A}\boldsymbol{X}^{n-1})^\top$ (when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute):

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sin(\boldsymbol{X})) &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k+1)!} (2k+1)(\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\ &=& (\boldsymbol{A}\cos(\boldsymbol{X}))^\top \end{eqnarray}

5.50 Derivative of $\text{tr}(\boldsymbol{A}\exp(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) = (\boldsymbol{A}\exp(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

Using the Taylor series as in 5.37:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{k!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^k) \end{eqnarray}

When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\exp(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{k}{k!} (\boldsymbol{A}\boldsymbol{X}^{k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(k-1)!} (\boldsymbol{A}\boldsymbol{X}^{k-1})^\top \\ &=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{m!} (\boldsymbol{A}\boldsymbol{X}^m)^\top \\ &=& (\boldsymbol{A}\exp(\boldsymbol{X}))^\top \end{eqnarray}
Remark: This formula holds when the matrices $\boldsymbol{A}$ and $\boldsymbol{X}$ commute ($\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$). In the non-commuting case, the differentiation becomes more complex, and one needs to use the Fréchet derivative formalism.

5.51 Derivative of $\text{tr}(\boldsymbol{A}\cos(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) = -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

Using the Taylor series:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \boldsymbol{X}^{2k} \right) \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k}) \end{eqnarray}

When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cos(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k)!} \cdot 2k \cdot (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{(-1)^k}{(2k-1)!} (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\ &=& -\displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{(-1)^m}{(2m+1)!} (\boldsymbol{A}\boldsymbol{X}^{2m+1})^\top \\ &=& -(\boldsymbol{A}\sin(\boldsymbol{X}))^\top \end{eqnarray}

5.52 Derivative of $\text{tr}(\boldsymbol{A}\tan(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X})) = (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting), $\cos(\boldsymbol{X})$ is invertible
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\tan(x) = \sec^2(x)$. Substituting $f(x) = \tan(x)$, $f'(x) = \sec^2(x)$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tan(\boldsymbol{X})) = (\boldsymbol{A}\sec^2(\boldsymbol{X}))^\top \qquad \square \notag \end{align}

Here $\sec^2(\boldsymbol{X}) = \cos(\boldsymbol{X})^{-2}$.

5.53 Derivative of $\text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\|\boldsymbol{X}\| < 1$, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arcsin(x) = \displaystyle\frac{1}{\sqrt{1-x^2}}$. Substituting $f(x) = \arcsin(x)$, $f'(x) = (1-x^2)^{-1/2}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arcsin(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2}$ is the inverse square root of the matrix $\boldsymbol{I}-\boldsymbol{X}^2$.

5.54 Derivative of $\text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X})) = -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\|\boldsymbol{X}\| < 1$, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arccos(x) = -\displaystyle\frac{1}{\sqrt{1-x^2}}$. Substituting $f(x) = \arccos(x)$, $f'(x) = -(1-x^2)^{-1/2}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arccos(\boldsymbol{X})) = -(\boldsymbol{A}(\boldsymbol{I}-\boldsymbol{X}^2)^{-1/2})^\top \qquad \square \notag \end{align}

Here the matrix $f'(\boldsymbol{X})$ differs from 5.53 only in sign.

5.55 Derivative of $\text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

In the scalar case, $\displaystyle\frac{d}{dx}\arctan(x) = \displaystyle\frac{1}{1+x^2}$. Substituting $f(x) = \arctan(x)$, $f'(x) = (1+x^2)^{-1}$ into the version with $\boldsymbol{A}$ of the general formula (requires $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$):

\begin{align} \frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\arctan(\boldsymbol{X})) = (\boldsymbol{A}(\boldsymbol{I}+\boldsymbol{X}^2)^{-1})^\top \qquad \square \notag \end{align}

Here $(\boldsymbol{I}+\boldsymbol{X}^2)^{-1}$ is the inverse of the matrix $\boldsymbol{I}+\boldsymbol{X}^2$.

5.56 Derivative of $\text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) = (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

Using the Taylor series:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \boldsymbol{X}^{2k+1} \right) \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k+1}) \end{eqnarray}

When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\sinh(\boldsymbol{X})) &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k+1)!} \cdot (2k+1) \cdot (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} (\boldsymbol{A}\boldsymbol{X}^{2k})^\top \\ &=& (\boldsymbol{A}\cosh(\boldsymbol{X}))^\top \end{eqnarray}

5.57 Derivative of $\text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) = (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting)
Proof

Using the Taylor series:

\begin{eqnarray} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) &=& \text{tr}\left( \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} \boldsymbol{X}^{2k} \right) \\ &=& \displaystyle\sum_{k=0}^{\infty} \displaystyle\frac{1}{(2k)!} \text{tr}(\boldsymbol{A}\boldsymbol{X}^{2k}) \end{eqnarray}

When $\boldsymbol{A}$ and $\boldsymbol{X}$ commute, differentiating term by term:

\begin{eqnarray} \displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\cosh(\boldsymbol{X})) &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k)!} \cdot 2k \cdot (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{k=1}^{\infty} \displaystyle\frac{1}{(2k-1)!} (\boldsymbol{A}\boldsymbol{X}^{2k-1})^\top \\ &=& \displaystyle\sum_{m=0}^{\infty} \displaystyle\frac{1}{(2m+1)!} (\boldsymbol{A}\boldsymbol{X}^{2m+1})^\top \\ &=& (\boldsymbol{A}\sinh(\boldsymbol{X}))^\top \end{eqnarray}

5.58 Derivative of $\text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X}))$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\text{sech}^2(\boldsymbol{X}))^\top$
Conditions: $\boldsymbol{X}$ is an $N \times N$ square matrix, $\boldsymbol{A}$ is a constant matrix, $\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$ (commuting), $\cosh(\boldsymbol{X})$ is invertible
Proof

The matrix hyperbolic tangent is defined by a power series:

$$\tanh(\boldsymbol{X}) = \boldsymbol{X} - \frac{1}{3}\boldsymbol{X}^3 + \frac{2}{15}\boldsymbol{X}^5 - \cdots$$

More precisely, $\tanh(\boldsymbol{X}) = \sinh(\boldsymbol{X})\cosh(\boldsymbol{X})^{-1}$, which is defined when $\cosh(\boldsymbol{X})$ is invertible.

Assume $\boldsymbol{X}$ is diagonalizable. Setting $\boldsymbol{X} = \boldsymbol{P}\boldsymbol{\Lambda}\boldsymbol{P}^{-1}$ ($\boldsymbol{\Lambda} = \text{diag}(\lambda_1, \ldots, \lambda_N)$), the matrix function acts on the eigenvalues:

$$\tanh(\boldsymbol{X}) = \boldsymbol{P}\,\text{diag}(\tanh(\lambda_1), \ldots, \tanh(\lambda_N))\,\boldsymbol{P}^{-1}$$

By the trace property $\text{tr}(\boldsymbol{A}\boldsymbol{P}\boldsymbol{D}\boldsymbol{P}^{-1}) = \text{tr}(\boldsymbol{P}^{-1}\boldsymbol{A}\boldsymbol{P}\boldsymbol{D})$, when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute ($\boldsymbol{A}\boldsymbol{X} = \boldsymbol{X}\boldsymbol{A}$), $\boldsymbol{A}$ can be diagonalized with the same eigenvectors as $\boldsymbol{X}$ (simultaneous diagonalization). Setting $\boldsymbol{P}^{-1}\boldsymbol{A}\boldsymbol{P} = \text{diag}(a_1, \ldots, a_N)$:

$$\text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = \sum_{i=1}^{N} a_i \tanh(\lambda_i)$$

Consider the derivative with respect to the $(p,q)$ entry $X_{pq}$ of $\boldsymbol{X}$. When $\boldsymbol{A}$ and $\boldsymbol{X}$ are simultaneously diagonalized, $\lambda_i$ are the eigenvalues of $\boldsymbol{X}$, and $\displaystyle\frac{\partial \lambda_i}{\partial X_{pq}}$ reduces to scalar differentiation. In the scalar case, $\displaystyle\frac{d}{d\lambda}\tanh(\lambda) = \text{sech}^2(\lambda)$, so by the chain rule:

$$\frac{\partial}{\partial X_{pq}} \sum_{i} a_i \tanh(\lambda_i) = \sum_{i} a_i\,\text{sech}^2(\lambda_i) \cdot \frac{\partial \lambda_i}{\partial X_{pq}}$$

Since $\text{sech}^2(\boldsymbol{X}) = \boldsymbol{P}\,\text{diag}(\text{sech}^2(\lambda_1), \ldots, \text{sech}^2(\lambda_N))\,\boldsymbol{P}^{-1}$, by the simultaneous diagonalization structure, the above sum can be reconstructed as entries of the matrix product $\boldsymbol{A}\,\text{sech}^2(\boldsymbol{X})$. Applying the general formula from 5.34, $\displaystyle\frac{\partial}{\partial \boldsymbol{X}}\text{tr}(\boldsymbol{A}f(\boldsymbol{X})) = (\boldsymbol{A}f'(\boldsymbol{X}))^\top$ (when $\boldsymbol{A}$ and $\boldsymbol{X}$ commute), with $f = \tanh$ and $f' = \text{sech}^2$:

$$\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{A}\tanh(\boldsymbol{X})) = (\boldsymbol{A}\,\text{sech}^2(\boldsymbol{X}))^\top \qquad \square$$

References

  • Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
  • Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
  • Matrix calculus - Wikipedia