Proofs Chapter 11: Matrix Powers and Composite Function Derivatives
Proofs Chapter 11: Matrix Powers and Composite Function Derivatives
This chapter proves differentiation formulas for matrix powers and composite functions. The derivative of the matrix exponential exp(A) plays a central role in optimization on Lie groups (pose estimation in robotics, SLAM), sensitivity analysis of continuous-time Markov chains, and matrix methods for ordinary differential equations. The derivative of the Rayleigh quotient is needed for convergence analysis of iterative eigenvalue solvers (inverse iteration, Lanczos method).
Prerequisites: Chapter 5 (Trace Derivatives), Chapter 9 (Eigenvalue Derivatives). Related chapters: Chapter 14 (Matrix Chain Rule).
11. Derivatives of Matrix Powers and Composite Functions
Unless otherwise stated, all formulas in this chapter hold under the following conditions:
- All formulas use the denominator layout
- The matrix $\boldsymbol{X}$ is assumed to be square
- $\boldsymbol{X}^0 = \boldsymbol{I}$ (identity matrix) by definition
We derive differentiation formulas for functions involving the $n$-th power of a matrix $\boldsymbol{X}^n$. The product rule is repeatedly applied to $\boldsymbol{X}^n = \underbrace{\boldsymbol{X} \cdot \boldsymbol{X} \cdots \boldsymbol{X}}_{n \text{ factors}}$. We also derive differentiation formulas for the composite function $\boldsymbol{s}(\boldsymbol{x})^\top \boldsymbol{A} \boldsymbol{r}(\boldsymbol{x})$ and the Rayleigh quotient.
11.1 Component-wise Derivative of Matrix Powers
Proof
Write $\boldsymbol{X}^n$ as a product of $n$ matrices.
\begin{equation}\boldsymbol{X}^n = \underbrace{\boldsymbol{X} \cdot \boldsymbol{X} \cdots \boldsymbol{X}}_{n \text{ factors}} \label{eq:11-1-1}\end{equation}
Differentiate the product of $n$ matrices with respect to $X_{ij}$. Repeatedly apply the product rule (1.25: Leibniz rule).
First consider the case $n = 2$. Apply the product rule (1.25) to $\boldsymbol{X}^2 = \boldsymbol{X} \cdot \boldsymbol{X}$.
\begin{equation}\dfrac{\partial \boldsymbol{X}^2}{\partial X_{ij}} = \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \cdot \boldsymbol{X} + \boldsymbol{X} \cdot \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \label{eq:11-1-2}\end{equation}
For $n = 3$, applying the rule to $\boldsymbol{X}^3 = \boldsymbol{X} \cdot \boldsymbol{X} \cdot \boldsymbol{X}$ gives
\begin{equation}\dfrac{\partial \boldsymbol{X}^3}{\partial X_{ij}} = \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \cdot \boldsymbol{X}^2 + \boldsymbol{X} \cdot \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \cdot \boldsymbol{X} + \boldsymbol{X}^2 \cdot \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \label{eq:11-1-3}\end{equation}
For general $n$, repeatedly applying the product rule (1.25) produces $n$ terms. The $r$-th term ($r = 0, 1, \ldots, n-1$) is the result of differentiating the $(r+1)$-th $\boldsymbol{X}$ from the left.
\begin{equation}\dfrac{\partial \boldsymbol{X}^n}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{X}^r \cdot \dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} \cdot \boldsymbol{X}^{n-1-r} \label{eq:11-1-4}\end{equation}
Here we define $\boldsymbol{X}^0 = \boldsymbol{I}$ (identity matrix).
By 4.3, differentiating the matrix $\boldsymbol{X}$ with respect to its component $X_{ij}$ yields a single-entry matrix.
\begin{equation}\dfrac{\partial \boldsymbol{X}}{\partial X_{ij}} = \boldsymbol{J}^{ij} \label{eq:11-1-5}\end{equation}
Here $\boldsymbol{J}^{ij}$ is the matrix with 1 at position $(i, j)$ and 0 elsewhere.
Substitute $\eqref{eq:11-1-5}$ into $\eqref{eq:11-1-4}$.
\begin{equation}\dfrac{\partial \boldsymbol{X}^n}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \label{eq:11-1-6}\end{equation}
Taking the $(k, l)$ component of $\eqref{eq:11-1-6}$ yields the final result.
\begin{equation}\dfrac{\partial (\boldsymbol{X}^n)_{kl}}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r})_{kl} \label{eq:11-1-7}\end{equation}
11.2 Bilinear Form of Matrix Powers
Proof
Consider the scalar function $f = \boldsymbol{a}^\top \boldsymbol{X}^n \boldsymbol{b}$. Here $\boldsymbol{a}$ and $\boldsymbol{b}$ are constant vectors.
Differentiate $f$ with respect to $X_{ij}$. Since $\boldsymbol{a}$ and $\boldsymbol{b}$ are constants, the differentiation acts only on $\boldsymbol{X}^n$.
\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \boldsymbol{a}^\top \dfrac{\partial \boldsymbol{X}^n}{\partial X_{ij}} \boldsymbol{b} \label{eq:11-2-1}\end{equation}
From $\eqref{eq:11-1-6}$ in 11.1, the component-wise derivative of $\boldsymbol{X}^n$ is
\begin{equation}\dfrac{\partial \boldsymbol{X}^n}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \label{eq:11-2-2}\end{equation}
Substitute $\eqref{eq:11-2-2}$ into $\eqref{eq:11-2-1}$.
\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{a}^\top \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{b} \label{eq:11-2-3}\end{equation}
Examine each term of $\eqref{eq:11-2-3}$ in detail. Define auxiliary vectors.
\begin{equation}\boldsymbol{u} = (\boldsymbol{X}^r)^\top \boldsymbol{a} \in \mathbb{R}^m \label{eq:11-2-4}\end{equation}
\begin{equation}\boldsymbol{v} = \boldsymbol{X}^{n-1-r} \boldsymbol{b} \in \mathbb{R}^m \label{eq:11-2-5}\end{equation}
Note that $\boldsymbol{a}^\top \boldsymbol{X}^r = ((\boldsymbol{X}^r)^\top \boldsymbol{a})^\top = \boldsymbol{u}^\top$.
\begin{equation}\boldsymbol{a}^\top \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{b} = \boldsymbol{u}^\top \boldsymbol{J}^{ij} \boldsymbol{v} \label{eq:11-2-6}\end{equation}
Compute $\boldsymbol{u}^\top \boldsymbol{J}^{ij} \boldsymbol{v}$ component-wise. Since $\boldsymbol{J}^{ij}$ has 1 only at position $(i, j)$,
\begin{equation}\boldsymbol{u}^\top \boldsymbol{J}^{ij} \boldsymbol{v} = \displaystyle\sum_{k,l} u_k (\boldsymbol{J}^{ij})_{kl} v_l = \displaystyle\sum_{k,l} u_k \delta_{ki} \delta_{lj} v_l = u_i v_j \label{eq:11-2-7}\end{equation}
Substitute $\eqref{eq:11-2-7}$ into $\eqref{eq:11-2-3}$.
\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} u_i v_j = \displaystyle\sum_{r=0}^{n-1} ((\boldsymbol{X}^r)^\top \boldsymbol{a})_i (\boldsymbol{X}^{n-1-r} \boldsymbol{b})_j \label{eq:11-2-8}\end{equation}
Write $\eqref{eq:11-2-8}$ in matrix form. In denominator layout, $\left(\displaystyle\dfrac{\partial f}{\partial \boldsymbol{X}}\right)_{ij} = \displaystyle\dfrac{\partial f}{\partial X_{ij}}$.
The $(i, j)$ component of the outer product $\boldsymbol{p} \boldsymbol{q}^\top$ is $p_i q_j$. From $\eqref{eq:11-2-8}$,
\begin{equation}((\boldsymbol{X}^r)^\top \boldsymbol{a})_i (\boldsymbol{X}^{n-1-r} \boldsymbol{b})_j = ((\boldsymbol{X}^r)^\top \boldsymbol{a} (\boldsymbol{X}^{n-1-r} \boldsymbol{b})^\top)_{ij} \label{eq:11-2-9}\end{equation}
Substitute $(\boldsymbol{X}^{n-1-r} \boldsymbol{b})^\top = \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top$.
\begin{equation}(\boldsymbol{X}^r)^\top \boldsymbol{a} (\boldsymbol{X}^{n-1-r} \boldsymbol{b})^\top = (\boldsymbol{X}^r)^\top \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top \label{eq:11-2-10}\end{equation}
We obtain the final result in matrix form.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} (\boldsymbol{X}^r)^\top \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top \label{eq:11-2-11}\end{equation}
11.3 Gram Form of Matrix Powers
Proof
Consider the scalar function $f = \boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^n \boldsymbol{b}$. Define auxiliary vectors.
\begin{equation}\boldsymbol{u} = \boldsymbol{X}^n \boldsymbol{a} \in \mathbb{R}^m \label{eq:11-3-1}\end{equation}
\begin{equation}\boldsymbol{v} = \boldsymbol{X}^n \boldsymbol{b} \in \mathbb{R}^m \label{eq:11-3-2}\end{equation}
Then $f$ can be written as an inner product. Since $\boldsymbol{a}^\top (\boldsymbol{X}^n)^\top = (\boldsymbol{X}^n \boldsymbol{a})^\top = \boldsymbol{u}^\top$,
\begin{equation}f = \boldsymbol{u}^\top \boldsymbol{v} \label{eq:11-3-3}\end{equation}
Differentiate $f$ with respect to $X_{ij}$. Since both $\boldsymbol{u}$ and $\boldsymbol{v}$ depend on $\boldsymbol{X}$, apply the product rule (1.25).
\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \dfrac{\partial \boldsymbol{u}^\top}{\partial X_{ij}} \boldsymbol{v} + \boldsymbol{u}^\top \dfrac{\partial \boldsymbol{v}}{\partial X_{ij}} \label{eq:11-3-4}\end{equation}
Since $\displaystyle\dfrac{\partial \boldsymbol{u}^\top}{\partial X_{ij}} = \left(\displaystyle\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}}\right)^\top$, $\eqref{eq:11-3-4}$ becomes
\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \left(\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}}\right)^\top \boldsymbol{v} + \boldsymbol{u}^\top \dfrac{\partial \boldsymbol{v}}{\partial X_{ij}} \label{eq:11-3-5}\end{equation}
Compute the derivative of $\boldsymbol{u} = \boldsymbol{X}^n \boldsymbol{a}$. From $\eqref{eq:11-1-6}$ in 11.1,
\begin{equation}\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}} = \dfrac{\partial (\boldsymbol{X}^n \boldsymbol{a})}{\partial X_{ij}} = \dfrac{\partial \boldsymbol{X}^n}{\partial X_{ij}} \boldsymbol{a} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{a} \label{eq:11-3-6}\end{equation}
Similarly, the derivative of $\boldsymbol{v} = \boldsymbol{X}^n \boldsymbol{b}$ is
\begin{equation}\dfrac{\partial \boldsymbol{v}}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{b} \label{eq:11-3-7}\end{equation}
Compute the first term of $\eqref{eq:11-3-5}$. The transpose of $\eqref{eq:11-3-6}$ is
\begin{equation}\left(\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}}\right)^\top = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{a}^\top (\boldsymbol{X}^{n-1-r})^\top (\boldsymbol{J}^{ij})^\top (\boldsymbol{X}^r)^\top \label{eq:11-3-8}\end{equation}
Note that $(\boldsymbol{J}^{ij})^\top = \boldsymbol{J}^{ji}$ (swapping the $(i, j)$ and $(j, i)$ positions).
Compute $\left(\displaystyle\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}}\right)^\top \boldsymbol{v}$.
\begin{equation}\left(\dfrac{\partial \boldsymbol{u}}{\partial X_{ij}}\right)^\top \boldsymbol{v} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{a}^\top (\boldsymbol{X}^{n-1-r})^\top \boldsymbol{J}^{ji} (\boldsymbol{X}^r)^\top \boldsymbol{v} \label{eq:11-3-9}\end{equation}
Substitute $\boldsymbol{v} = \boldsymbol{X}^n \boldsymbol{b}$ and evaluate the action of $\boldsymbol{J}^{ji}$. We have $\boldsymbol{p}^\top \boldsymbol{J}^{ji} \boldsymbol{q} = p_j q_i$.
\begin{equation}\boldsymbol{a}^\top (\boldsymbol{X}^{n-1-r})^\top \boldsymbol{J}^{ji} (\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{b} = ((\boldsymbol{X}^{n-1-r}) \boldsymbol{a})_j ((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{b})_i \label{eq:11-3-10}\end{equation}
Compute the second term $\boldsymbol{u}^\top \displaystyle\dfrac{\partial \boldsymbol{v}}{\partial X_{ij}}$ similarly.
\begin{equation}\boldsymbol{u}^\top \dfrac{\partial \boldsymbol{v}}{\partial X_{ij}} = \displaystyle\sum_{r=0}^{n-1} \boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{b} \label{eq:11-3-11}\end{equation}
Evaluate the action of $\boldsymbol{J}^{ij}$. We have $\boldsymbol{p}^\top \boldsymbol{J}^{ij} \boldsymbol{q} = p_i q_j$.
\begin{equation}\boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^r \boldsymbol{J}^{ij} \boldsymbol{X}^{n-1-r} \boldsymbol{b} = ((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{a})_i ((\boldsymbol{X}^{n-1-r}) \boldsymbol{b})_j \label{eq:11-3-12}\end{equation}
Write $\eqref{eq:11-3-10}$ and $\eqref{eq:11-3-12}$ in matrix form. The $(i, j)$ component of the outer product $\boldsymbol{p} \boldsymbol{q}^\top$ is $p_i q_j$.
From $\eqref{eq:11-3-10}$, the contribution of the first term is
\begin{equation}((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{b})_i ((\boldsymbol{X}^{n-1-r}) \boldsymbol{a})_j = ((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{b} \boldsymbol{a}^\top (\boldsymbol{X}^{n-1-r})^\top)_{ij} \label{eq:11-3-13}\end{equation}
From $\eqref{eq:11-3-12}$, the contribution of the second term is
\begin{equation}((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{a})_i ((\boldsymbol{X}^{n-1-r}) \boldsymbol{b})_j = ((\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top)_{ij} \label{eq:11-3-14}\end{equation}
We obtain the final result in matrix form.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \boldsymbol{a}^\top (\boldsymbol{X}^n)^\top \boldsymbol{X}^n \boldsymbol{b} = \displaystyle\sum_{r=0}^{n-1} \left[ (\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{b} \boldsymbol{a}^\top (\boldsymbol{X}^{n-1-r})^\top + (\boldsymbol{X}^r)^\top \boldsymbol{X}^n \boldsymbol{a} \boldsymbol{b}^\top (\boldsymbol{X}^{n-1-r})^\top \right] \label{eq:11-3-15}\end{equation}
11.4 Derivative of Composite Bilinear Form
Proof
Consider the scalar function $f = \boldsymbol{s}^\top \boldsymbol{A} \boldsymbol{r}$. Here $\boldsymbol{s}(\boldsymbol{x})$ and $\boldsymbol{r}(\boldsymbol{x})$ are vector functions depending on $\boldsymbol{x}$.
Expand $f$ in components.
\begin{equation}f = \boldsymbol{s}^\top \boldsymbol{A} \boldsymbol{r} = \displaystyle\sum_{i=0}^{m-1} \displaystyle\sum_{j=0}^{m-1} s_i A_{ij} r_j \label{eq:11-4-1}\end{equation}
Differentiate $f$ with respect to $x_k$. Since $A_{ij}$ is constant, the product rule (1.25) gives the derivatives of $s_i$ and $r_j$.
\begin{equation}\dfrac{\partial f}{\partial x_k} = \displaystyle\sum_{i,j} \dfrac{\partial s_i}{\partial x_k} A_{ij} r_j + \displaystyle\sum_{i,j} s_i A_{ij} \dfrac{\partial r_j}{\partial x_k} \label{eq:11-4-2}\end{equation}
Compute the first term. First evaluate the sum over $j$.
\begin{equation}\displaystyle\sum_{i,j} \dfrac{\partial s_i}{\partial x_k} A_{ij} r_j = \displaystyle\sum_{i} \dfrac{\partial s_i}{\partial x_k} \displaystyle\sum_{j} A_{ij} r_j = \displaystyle\sum_{i} \dfrac{\partial s_i}{\partial x_k} (\boldsymbol{A} \boldsymbol{r})_i \label{eq:11-4-3}\end{equation}
Write $\eqref{eq:11-4-3}$ as a matrix product. $\displaystyle\dfrac{\partial s_i}{\partial x_k}$ is the $(i, k)$ component of the Jacobian matrix $\displaystyle\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}} \in \mathbb{R}^{m \times n}$.
\begin{equation}\displaystyle\sum_{i} \dfrac{\partial s_i}{\partial x_k} (\boldsymbol{A} \boldsymbol{r})_i = \displaystyle\sum_{i} \left(\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right)_{ik} (\boldsymbol{A} \boldsymbol{r})_i \label{eq:11-4-4}\end{equation}
Since $\left(\displaystyle\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right)_{ik} = \left(\left[\displaystyle\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top\right)_{ki}$,
\begin{equation}\displaystyle\sum_{i} \left(\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right)_{ik} (\boldsymbol{A} \boldsymbol{r})_i = \displaystyle\sum_{i} \left(\left[\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top\right)_{ki} (\boldsymbol{A} \boldsymbol{r})_i = \left(\left[\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A} \boldsymbol{r}\right)_k \label{eq:11-4-5}\end{equation}
Compute the second term. First evaluate the sum over $i$.
\begin{equation}\displaystyle\sum_{i,j} s_i A_{ij} \dfrac{\partial r_j}{\partial x_k} = \displaystyle\sum_{j} \dfrac{\partial r_j}{\partial x_k} \displaystyle\sum_{i} s_i A_{ij} \label{eq:11-4-6}\end{equation}
Substitute $\displaystyle\sum_i s_i A_{ij} = (\boldsymbol{s}^\top \boldsymbol{A})_j = (\boldsymbol{A}^\top \boldsymbol{s})_j$.
\begin{equation}\displaystyle\sum_{j} \dfrac{\partial r_j}{\partial x_k} (\boldsymbol{A}^\top \boldsymbol{s})_j \label{eq:11-4-7}\end{equation}
Following the same procedure as for the first term, write $\eqref{eq:11-4-7}$ as a matrix product.
\begin{equation}\displaystyle\sum_{j} \dfrac{\partial r_j}{\partial x_k} (\boldsymbol{A}^\top \boldsymbol{s})_j = \displaystyle\sum_{j} \left(\left[\dfrac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top\right)_{kj} (\boldsymbol{A}^\top \boldsymbol{s})_j = \left(\left[\dfrac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A}^\top \boldsymbol{s}\right)_k \label{eq:11-4-8}\end{equation}
Substitute $\eqref{eq:11-4-5}$ and $\eqref{eq:11-4-8}$ into $\eqref{eq:11-4-2}$.
\begin{equation}\dfrac{\partial f}{\partial x_k} = \left(\left[\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A} \boldsymbol{r}\right)_k + \left(\left[\dfrac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A}^\top \boldsymbol{s}\right)_k \label{eq:11-4-9}\end{equation}
Since $\eqref{eq:11-4-9}$ holds for all $k = 0, \ldots, n-1$, we obtain the final result in vector form.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \boldsymbol{s}^\top \boldsymbol{A} \boldsymbol{r} = \left[\dfrac{\partial \boldsymbol{s}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A} \boldsymbol{r} + \left[\dfrac{\partial \boldsymbol{r}}{\partial \boldsymbol{x}}\right]^\top \boldsymbol{A}^\top \boldsymbol{s} \label{eq:11-4-10}\end{equation}
11.5 Derivative of the Rayleigh Quotient
Proof
Consider the Rayleigh quotient $f = \displaystyle\dfrac{\boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}}$. Define the numerator and denominator separately.
\begin{equation}u = \boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x} \label{eq:11-5-1}\end{equation}
\begin{equation}v = \boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x} \label{eq:11-5-2}\end{equation}
Then $f = u / v$. Since both $u$ and $v$ are functions of $\boldsymbol{x}$, apply the quotient rule (1.28).
\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{x}} = \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{u}{v} \right) = \dfrac{1}{v} \dfrac{\partial u}{\partial \boldsymbol{x}} - \dfrac{u}{v^2} \dfrac{\partial v}{\partial \boldsymbol{x}} \label{eq:11-5-3}\end{equation}
Compute the derivative of $u = \boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}$. Let $\boldsymbol{C} = \boldsymbol{A}^\top \boldsymbol{A}$; then $\boldsymbol{C}$ is symmetric ($\boldsymbol{C}^\top = (\boldsymbol{A}^\top \boldsymbol{A})^\top = \boldsymbol{A}^\top \boldsymbol{A} = \boldsymbol{C}$).
By 10.6, $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{C} \boldsymbol{x}) = (\boldsymbol{C} + \boldsymbol{C}^\top) \boldsymbol{x}$. Since $\boldsymbol{C}$ is symmetric,
\begin{equation}\dfrac{\partial u}{\partial \boldsymbol{x}} = (\boldsymbol{A}^\top \boldsymbol{A} + \boldsymbol{A}^\top \boldsymbol{A}) \boldsymbol{x} = 2 \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x} \label{eq:11-5-4}\end{equation}
Similarly, compute the derivative of $v = \boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}$. $\boldsymbol{D} = \boldsymbol{B}^\top \boldsymbol{B}$ is also symmetric.
\begin{equation}\dfrac{\partial v}{\partial \boldsymbol{x}} = 2 \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x} \label{eq:11-5-5}\end{equation}
Substitute $\eqref{eq:11-5-4}$ and $\eqref{eq:11-5-5}$ into $\eqref{eq:11-5-3}$.
\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{x}} = \dfrac{1}{v} \cdot 2 \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x} - \dfrac{u}{v^2} \cdot 2 \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x} \label{eq:11-5-6}\end{equation}
Simplify $\eqref{eq:11-5-6}$.
\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{x}} = \dfrac{2 \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{v} - \dfrac{2u \cdot \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}}{v^2} \label{eq:11-5-7}\end{equation}
Substitute the definitions of $u$ and $v$ to obtain the final result.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \dfrac{\boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}} = \dfrac{2 \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}}{\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}} - \dfrac{2 (\boldsymbol{x}^\top \boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{x}) \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x}}{(\boldsymbol{x}^\top \boldsymbol{B}^\top \boldsymbol{B} \boldsymbol{x})^2} \label{eq:11-5-8}\end{equation}
Source: J.W. Strutt (Lord Rayleigh) (1877) "The Theory of Sound", Vol.1. The variational characterization of eigenvalues is known as the Rayleigh-Ritz method.
11.6a/b Gradient and Hessian of Quadratic Form
Gradient: $\nabla_{\boldsymbol{x}} f = (\boldsymbol{A} + \boldsymbol{A}^\top)\boldsymbol{x} + \boldsymbol{b}$
Hessian: $\boldsymbol{H} = \boldsymbol{A} + \boldsymbol{A}^\top$
Proof
Consider the scalar function $f(\boldsymbol{x}) = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} + \boldsymbol{b}^\top \boldsymbol{x}$.
[Derivation of the Gradient]
Decompose $f$ into two terms.
\begin{equation}f = f_1 + f_2, \quad f_1 = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}, \quad f_2 = \boldsymbol{b}^\top \boldsymbol{x} \label{eq:11-6-1}\end{equation}
Compute the gradient of $f_1 = \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}$. By 10.6,
\begin{equation}\nabla_{\boldsymbol{x}} f_1 = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x} \label{eq:11-6-2}\end{equation}
Compute the gradient of $f_2 = \boldsymbol{b}^\top \boldsymbol{x}$. Write $f_2$ in components:
\begin{equation}f_2 = \displaystyle\sum_{i=0}^{n-1} b_i x_i \label{eq:11-6-3}\end{equation}
Differentiate $f_2$ with respect to $x_k$. The $b_i$ are constants.
\begin{equation}\dfrac{\partial f_2}{\partial x_k} = \displaystyle\sum_{i=0}^{n-1} b_i \dfrac{\partial x_i}{\partial x_k} = \displaystyle\sum_{i=0}^{n-1} b_i \delta_{ik} = b_k \label{eq:11-6-4}\end{equation}
Since $\eqref{eq:11-6-4}$ holds for all $k$, in vector form:
\begin{equation}\nabla_{\boldsymbol{x}} f_2 = \boldsymbol{b} \label{eq:11-6-5}\end{equation}
Combining $\eqref{eq:11-6-2}$ and $\eqref{eq:11-6-5}$, we obtain the gradient of $f$.
\begin{equation}\nabla_{\boldsymbol{x}} f = \nabla_{\boldsymbol{x}} f_1 + \nabla_{\boldsymbol{x}} f_2 = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x} + \boldsymbol{b} \label{eq:11-6-6}\end{equation}
[Derivation of the Hessian]
The Hessian matrix $\boldsymbol{H}$ is the matrix obtained by differentiating the gradient once more with respect to $\boldsymbol{x}$.
\begin{equation}H_{kl} = \dfrac{\partial^2 f}{\partial x_l \partial x_k} = \dfrac{\partial}{\partial x_l} \left( \dfrac{\partial f}{\partial x_k} \right) \label{eq:11-6-7}\end{equation}
From $\eqref{eq:11-6-6}$, the $k$-th component of the gradient is
\begin{equation}\left( \nabla_{\boldsymbol{x}} f \right)_k = \displaystyle\sum_{j=0}^{n-1} (A_{kj} + A_{jk}) x_j + b_k \label{eq:11-6-8}\end{equation}
Differentiate $\eqref{eq:11-6-8}$ with respect to $x_l$. The quantities $A_{kj}$, $A_{jk}$, and $b_k$ are all constants.
\begin{equation}\dfrac{\partial}{\partial x_l} \left( \nabla_{\boldsymbol{x}} f \right)_k = \displaystyle\sum_{j=0}^{n-1} (A_{kj} + A_{jk}) \dfrac{\partial x_j}{\partial x_l} = \displaystyle\sum_{j=0}^{n-1} (A_{kj} + A_{jk}) \delta_{jl} \label{eq:11-6-9}\end{equation}
The Kronecker delta $\delta_{jl}$ selects only the $j = l$ term.
\begin{equation}H_{kl} = A_{kl} + A_{lk} \label{eq:11-6-10}\end{equation}
Write $\eqref{eq:11-6-10}$ in matrix form.
\begin{equation}\boldsymbol{H} = \boldsymbol{A} + \boldsymbol{A}^\top \label{eq:11-6-11}\end{equation}
11.1 Derivatives of the Matrix Exponential
We derive the Frechet derivative of the matrix exponential $e^{\boldsymbol{A}}$.
11.7 Frechet Derivative of the Matrix Exponential
Proof
The Frechet derivative $D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}]$ represents the directional derivative in the direction $\boldsymbol{E}$. By definition,
\begin{equation}D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \lim_{t \to 0} \dfrac{e^{\boldsymbol{A} + t\boldsymbol{E}} - e^{\boldsymbol{A}}}{t} \label{eq:11-7-1}\end{equation}
Define an auxiliary matrix function $\boldsymbol{F}(t)$.
\begin{equation}\boldsymbol{F}(t) = e^{-t\boldsymbol{A}} e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-2}\end{equation}
Verify the value of $\boldsymbol{F}$ at $t = 0$.
\begin{equation}\boldsymbol{F}(0) = e^{\boldsymbol{O}} e^{\boldsymbol{O}} = \boldsymbol{I} \cdot \boldsymbol{I} = \boldsymbol{I} \label{eq:11-7-3}\end{equation}
Differentiate $\boldsymbol{F}(t)$ with respect to $t$. Apply the product rule (1.25).
\begin{equation}\dfrac{d\boldsymbol{F}}{dt} = \dfrac{d}{dt}\left(e^{-t\boldsymbol{A}}\right) e^{t(\boldsymbol{A} + \boldsymbol{E})} + e^{-t\boldsymbol{A}} \dfrac{d}{dt}\left(e^{t(\boldsymbol{A} + \boldsymbol{E})}\right) \label{eq:11-7-4}\end{equation}
By the scalar derivative of the matrix exponential (proved in 11.8), $\displaystyle\dfrac{d}{dt} e^{t\boldsymbol{M}} = \boldsymbol{M} e^{t\boldsymbol{M}}$.
\begin{equation}\dfrac{d}{dt}\left(e^{-t\boldsymbol{A}}\right) = -\boldsymbol{A} e^{-t\boldsymbol{A}} \label{eq:11-7-5}\end{equation}
\begin{equation}\dfrac{d}{dt}\left(e^{t(\boldsymbol{A} + \boldsymbol{E})}\right) = (\boldsymbol{A} + \boldsymbol{E}) e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-6}\end{equation}
Substitute $\eqref{eq:11-7-5}$ and $\eqref{eq:11-7-6}$ into $\eqref{eq:11-7-4}$.
\begin{equation}\dfrac{d\boldsymbol{F}}{dt} = -\boldsymbol{A} e^{-t\boldsymbol{A}} e^{t(\boldsymbol{A} + \boldsymbol{E})} + e^{-t\boldsymbol{A}} (\boldsymbol{A} + \boldsymbol{E}) e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-7}\end{equation}
Rearrange the first and second terms of $\eqref{eq:11-7-7}$. Using $\boldsymbol{F}(t) = e^{-t\boldsymbol{A}} e^{t(\boldsymbol{A} + \boldsymbol{E})}$,
\begin{equation}\dfrac{d\boldsymbol{F}}{dt} = -\boldsymbol{A} \boldsymbol{F}(t) + e^{-t\boldsymbol{A}} (\boldsymbol{A} + \boldsymbol{E}) e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-8}\end{equation}
Expand the second term into $e^{-t\boldsymbol{A}} \boldsymbol{A} e^{t(\boldsymbol{A} + \boldsymbol{E})}$ and $e^{-t\boldsymbol{A}} \boldsymbol{E} e^{t(\boldsymbol{A} + \boldsymbol{E})}$.
\begin{equation}\dfrac{d\boldsymbol{F}}{dt} = -\boldsymbol{A} \boldsymbol{F}(t) + e^{-t\boldsymbol{A}} \boldsymbol{A} e^{t(\boldsymbol{A} + \boldsymbol{E})} + e^{-t\boldsymbol{A}} \boldsymbol{E} e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-9}\end{equation}
Simplify the sum of the first and second terms of $\eqref{eq:11-7-9}$. Substituting $\boldsymbol{F}(t) = e^{-t\boldsymbol{A}} e^{t(\boldsymbol{A}+\boldsymbol{E})}$ into $-\boldsymbol{A}\boldsymbol{F}(t) + e^{-t\boldsymbol{A}}\boldsymbol{A}e^{t(\boldsymbol{A}+\boldsymbol{E})}$ gives $-\boldsymbol{A}e^{-t\boldsymbol{A}}e^{t(\boldsymbol{A}+\boldsymbol{E})} + e^{-t\boldsymbol{A}}\boldsymbol{A}e^{t(\boldsymbol{A}+\boldsymbol{E})}$. Since $\boldsymbol{A}$ and $e^{-t\boldsymbol{A}}$ commute (proved in 11.8, $\eqref{eq:11-8-15}$), $\boldsymbol{A}e^{-t\boldsymbol{A}} = e^{-t\boldsymbol{A}}\boldsymbol{A}$, so these two terms cancel exactly. Therefore,
\begin{equation}\dfrac{d\boldsymbol{F}}{dt} = e^{-t\boldsymbol{A}} \boldsymbol{E} e^{t(\boldsymbol{A} + \boldsymbol{E})} \label{eq:11-7-10}\end{equation}
Compute $\boldsymbol{F}(1) - \boldsymbol{F}(0)$ using the fundamental theorem of calculus.
\begin{equation}\boldsymbol{F}(1) - \boldsymbol{I} = \displaystyle\int_0^1 \dfrac{d\boldsymbol{F}}{dt} dt \label{eq:11-7-11}\end{equation}
From $\eqref{eq:11-7-2}$, $\boldsymbol{F}(1) = e^{-\boldsymbol{A}} e^{\boldsymbol{A} + \boldsymbol{E}}$. Multiplying both sides on the left by $e^{\boldsymbol{A}}$,
\begin{equation}e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}} = e^{\boldsymbol{A}} \displaystyle\int_0^1 \dfrac{d\boldsymbol{F}}{dt} dt \label{eq:11-7-12}\end{equation}
Extract the first-order term in $\boldsymbol{E}$. Replacing $t$ with $s$ in $\eqref{eq:11-7-10}$ and using the first-order approximation $e^{s(\boldsymbol{A} + \boldsymbol{E})} \approx e^{s\boldsymbol{A}}$,
\begin{equation}D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds \label{eq:11-7-13}\end{equation}
11.8 Scalar Parameter Derivative of the Matrix Exponential
Proof
Recall the definition of the matrix exponential $e^{t\boldsymbol{A}}$. The matrix exponential is defined by a power series.
\begin{equation}e^{t\boldsymbol{A}} = \displaystyle\sum_{k=0}^{\infty} \dfrac{(t\boldsymbol{A})^k}{k!} \label{eq:11-8-1}\end{equation}
Expand $(t\boldsymbol{A})^k$ in $\eqref{eq:11-8-1}$. Since $t$ is a scalar, it can be factored out of the matrix power.
\begin{equation}(t\boldsymbol{A})^k = t^k \boldsymbol{A}^k \label{eq:11-8-2}\end{equation}
Substitute $\eqref{eq:11-8-2}$ into $\eqref{eq:11-8-1}$.
\begin{equation}e^{t\boldsymbol{A}} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^k}{k!} \label{eq:11-8-3}\end{equation}
Differentiate $\eqref{eq:11-8-3}$ with respect to $t$. Term-by-term differentiation is permitted (by the convergence of the matrix exponential).
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \dfrac{\partial}{\partial t} \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^k}{k!} = \displaystyle\sum_{k=0}^{\infty} \dfrac{\boldsymbol{A}^k}{k!} \dfrac{\partial}{\partial t} t^k \label{eq:11-8-4}\end{equation}
Differentiate $t^k$ with respect to $t$. When $k = 0$, $t^0 = 1$ so the derivative is 0.
\begin{equation}\dfrac{\partial}{\partial t} t^k = \begin{cases} 0 & (k = 0) \\ k t^{k-1} & (k \geq 1) \end{cases} \label{eq:11-8-5}\end{equation}
Substitute $\eqref{eq:11-8-5}$ into $\eqref{eq:11-8-4}$. The $k = 0$ term vanishes, so the sum starts from $k = 1$.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \displaystyle\sum_{k=1}^{\infty} \dfrac{\boldsymbol{A}^k}{k!} \cdot k t^{k-1} \label{eq:11-8-6}\end{equation}
Simplify $\displaystyle\dfrac{k}{k!}$ in $\eqref{eq:11-8-6}$. Since $k! = k \cdot (k-1)!$,
\begin{equation}\dfrac{k}{k!} = \dfrac{k}{k \cdot (k-1)!} = \dfrac{1}{(k-1)!} \label{eq:11-8-7}\end{equation}
Substitute $\eqref{eq:11-8-7}$ into $\eqref{eq:11-8-6}$.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \displaystyle\sum_{k=1}^{\infty} \dfrac{t^{k-1} \boldsymbol{A}^k}{(k-1)!} \label{eq:11-8-8}\end{equation}
Factor $\boldsymbol{A}^k = \boldsymbol{A} \cdot \boldsymbol{A}^{k-1}$.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \displaystyle\sum_{k=1}^{\infty} \dfrac{t^{k-1} \boldsymbol{A} \boldsymbol{A}^{k-1}}{(k-1)!} \label{eq:11-8-9}\end{equation}
Since $\boldsymbol{A}$ does not depend on $k$, factor it out of the sum.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} \displaystyle\sum_{k=1}^{\infty} \dfrac{t^{k-1} \boldsymbol{A}^{k-1}}{(k-1)!} \label{eq:11-8-10}\end{equation}
Re-index with $j = k - 1$. When $k = 1$, $j = 0$; as $k \to \infty$, $j \to \infty$.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} \displaystyle\sum_{j=0}^{\infty} \dfrac{t^{j} \boldsymbol{A}^{j}}{j!} \label{eq:11-8-11}\end{equation}
By $\eqref{eq:11-8-3}$, $\displaystyle\sum_{j=0}^{\infty} \displaystyle\dfrac{t^{j} \boldsymbol{A}^{j}}{j!} = e^{t\boldsymbol{A}}$.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} e^{t\boldsymbol{A}} \label{eq:11-8-12}\end{equation}
[Proof that $\boldsymbol{A}$ and $e^{t\boldsymbol{A}}$ commute]
We show that $\boldsymbol{A}$ and $e^{t\boldsymbol{A}}$ commute. Using the power series expansion of $e^{t\boldsymbol{A}}$,
\begin{equation}\boldsymbol{A} e^{t\boldsymbol{A}} = \boldsymbol{A} \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^k}{k!} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A} \boldsymbol{A}^k}{k!} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^{k+1}}{k!} \label{eq:11-8-13}\end{equation}
Similarly, compute $e^{t\boldsymbol{A}} \boldsymbol{A}$.
\begin{equation}e^{t\boldsymbol{A}} \boldsymbol{A} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^k}{k!} \boldsymbol{A} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^k \boldsymbol{A}}{k!} = \displaystyle\sum_{k=0}^{\infty} \dfrac{t^k \boldsymbol{A}^{k+1}}{k!} \label{eq:11-8-14}\end{equation}
Since $\eqref{eq:11-8-13}$ and $\eqref{eq:11-8-14}$ are equal, $\boldsymbol{A}$ and $e^{t\boldsymbol{A}}$ commute.
\begin{equation}\boldsymbol{A} e^{t\boldsymbol{A}} = e^{t\boldsymbol{A}} \boldsymbol{A} \label{eq:11-8-15}\end{equation}
Combining $\eqref{eq:11-8-12}$ and $\eqref{eq:11-8-15}$, we obtain the final result.
\begin{equation}\dfrac{\partial}{\partial t} e^{t\boldsymbol{A}} = \boldsymbol{A} e^{t\boldsymbol{A}} = e^{t\boldsymbol{A}} \boldsymbol{A} \label{eq:11-8-16}\end{equation}
11.9 Trace Derivative of the Matrix Exponential
Proof
Consider the scalar function $f = \text{tr}(e^{\boldsymbol{A}})$. We differentiate $f$ with respect to the matrix $\boldsymbol{A}$.
By 11.7, the Frechet derivative of the matrix exponential is given by
\begin{equation}D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds \label{eq:11-9-1}\end{equation}
where $\boldsymbol{E}$ is the matrix representing the direction of perturbation.
Compute the directional derivative of $f = \text{tr}(e^{\boldsymbol{A}})$. Since the trace is a linear operation, it commutes with the Frechet derivative.
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = D_{\boldsymbol{A}} \text{tr}(e^{\boldsymbol{A}})[\boldsymbol{E}] = \text{tr}(D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}]) \label{eq:11-9-2}\end{equation}
Substitute $\eqref{eq:11-9-1}$ into $\eqref{eq:11-9-2}$.
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \text{tr}\left(\displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds\right) \label{eq:11-9-3}\end{equation}
The trace and integral can be exchanged (by linearity).
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \displaystyle\int_0^1 \text{tr}\left(e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}}\right) ds \label{eq:11-9-4}\end{equation}
Apply the cyclic property of the trace (1.12).
\begin{equation}\text{tr}\left(e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}}\right) = \text{tr}\left(e^{(1-s)\boldsymbol{A}} e^{s\boldsymbol{A}} \boldsymbol{E}\right) \label{eq:11-9-5}\end{equation}
Compute $e^{(1-s)\boldsymbol{A}} e^{s\boldsymbol{A}}$. By the additive property of the matrix exponential (valid since $\boldsymbol{A}$ commutes with itself),
\begin{equation}e^{(1-s)\boldsymbol{A}} e^{s\boldsymbol{A}} = e^{(1-s)\boldsymbol{A} + s\boldsymbol{A}} = e^{\boldsymbol{A}} \label{eq:11-9-6}\end{equation}
Substitute $\eqref{eq:11-9-6}$ into $\eqref{eq:11-9-5}$.
\begin{equation}\text{tr}\left(e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}}\right) = \text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) \label{eq:11-9-7}\end{equation}
Note that the right-hand side of $\eqref{eq:11-9-7}$ does not depend on $s$. Substitute into $\eqref{eq:11-9-4}$.
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \displaystyle\int_0^1 \text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) ds \label{eq:11-9-8}\end{equation}
Since the integrand does not depend on $s$, evaluate the integral.
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) \cdot \displaystyle\int_0^1 ds = \text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) \cdot 1 \label{eq:11-9-9}\end{equation}
Therefore,
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) \label{eq:11-9-10}\end{equation}
Use the relationship between directional derivative and gradient to find the gradient matrix $\boldsymbol{G} = \displaystyle\dfrac{\partial f}{\partial \boldsymbol{A}}$. In general, the directional derivative of a scalar function $f$ is expressed as an inner product with the gradient.
\begin{equation}D_{\boldsymbol{A}} f[\boldsymbol{E}] = \text{tr}\left(\boldsymbol{G}^\top \boldsymbol{E}\right) \label{eq:11-9-11}\end{equation}
Compare $\eqref{eq:11-9-10}$ and $\eqref{eq:11-9-11}$.
\begin{equation}\text{tr}\left(e^{\boldsymbol{A}} \boldsymbol{E}\right) = \text{tr}\left(\boldsymbol{G}^\top \boldsymbol{E}\right) \label{eq:11-9-12}\end{equation}
For $\eqref{eq:11-9-12}$ to hold for all $\boldsymbol{E}$, the following equality is necessary.
\begin{equation}\boldsymbol{G}^\top = e^{\boldsymbol{A}} \label{eq:11-9-13}\end{equation}
Transpose both sides of $\eqref{eq:11-9-13}$.
\begin{equation}\boldsymbol{G} = (e^{\boldsymbol{A}})^\top \label{eq:11-9-14}\end{equation}
Since $\boldsymbol{G} = \displaystyle\dfrac{\partial f}{\partial \boldsymbol{A}}$, we obtain the final result.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{A}} \text{tr}(e^{\boldsymbol{A}}) = (e^{\boldsymbol{A}})^\top \label{eq:11-9-15}\end{equation}
Condition Number of the Matrix Exponential
Proof
We define the condition number of the matrix exponential $e^{\boldsymbol{A}}$ and clarify its meaning.
[Definition of the Condition Number]
In numerical analysis, the condition number of a function $f$ measures how much relative error in the input is amplified in the relative error of the output.
\begin{equation}\kappa(f) = \lim_{\epsilon \to 0} \sup_{\|\delta \boldsymbol{x}\| \leq \epsilon \|\boldsymbol{x}\|} \dfrac{\|f(\boldsymbol{x} + \delta \boldsymbol{x}) - f(\boldsymbol{x})\| / \|f(\boldsymbol{x})\|}{\|\delta \boldsymbol{x}\| / \|\boldsymbol{x}\|} \label{eq:11-10-1}\end{equation}
For $f(\boldsymbol{A}) = e^{\boldsymbol{A}}$, let the input perturbation be $\boldsymbol{E}$.
\begin{equation}\boldsymbol{A} \to \boldsymbol{A} + \boldsymbol{E} \label{eq:11-10-2}\end{equation}
[Evaluating the Variation via First-Order Approximation]
Expand $e^{\boldsymbol{A} + \boldsymbol{E}}$ to first order in $\boldsymbol{E}$. Use the Frechet derivative from 11.7.
\begin{equation}e^{\boldsymbol{A} + \boldsymbol{E}} = e^{\boldsymbol{A}} + D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] + O(\|\boldsymbol{E}\|^2) \label{eq:11-10-3}\end{equation}
Rearrange $\eqref{eq:11-10-3}$ to express the output variation.
\begin{equation}e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}} = D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] + O(\|\boldsymbol{E}\|^2) \label{eq:11-10-4}\end{equation}
When $\|\boldsymbol{E}\|$ is sufficiently small, the higher-order terms can be neglected.
\begin{equation}e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}} \approx D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] \label{eq:11-10-5}\end{equation}
Write the Frechet derivative as a linear operator $L(\boldsymbol{A}, \cdot)$.
\begin{equation}L(\boldsymbol{A}, \boldsymbol{E}) = D_{\boldsymbol{A}} e^{\boldsymbol{A}}[\boldsymbol{E}] = \displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds \label{eq:11-10-6}\end{equation}
[Norm Estimate of the Output Variation]
Take the norm of both sides of $\eqref{eq:11-10-5}$.
\begin{equation}\|e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}}\| \approx \|L(\boldsymbol{A}, \boldsymbol{E})\| \label{eq:11-10-7}\end{equation}
Define the operator norm of the linear operator $L(\boldsymbol{A}, \cdot)$.
\begin{equation}\|L(\boldsymbol{A}, \cdot)\| = \sup_{\boldsymbol{E} \neq \boldsymbol{O}} \dfrac{\|L(\boldsymbol{A}, \boldsymbol{E})\|}{\|\boldsymbol{E}\|} = \sup_{\|\boldsymbol{E}\| = 1} \|L(\boldsymbol{A}, \boldsymbol{E})\| \label{eq:11-10-8}\end{equation}
Using $\eqref{eq:11-10-8}$, the following inequality holds for any $\boldsymbol{E}$.
\begin{equation}\|L(\boldsymbol{A}, \boldsymbol{E})\| \leq \|L(\boldsymbol{A}, \cdot)\| \cdot \|\boldsymbol{E}\| \label{eq:11-10-9}\end{equation}
[Relative Error Estimate]
Combine $\eqref{eq:11-10-7}$ and $\eqref{eq:11-10-9}$.
\begin{equation}\|e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}}\| \lesssim \|L(\boldsymbol{A}, \cdot)\| \cdot \|\boldsymbol{E}\| \label{eq:11-10-10}\end{equation}
Divide both sides by $\|e^{\boldsymbol{A}}\|$ to obtain the relative error of the output.
\begin{equation}\dfrac{\|e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}}\|}{\|e^{\boldsymbol{A}}\|} \lesssim \dfrac{\|L(\boldsymbol{A}, \cdot)\|}{\|e^{\boldsymbol{A}}\|} \cdot \|\boldsymbol{E}\| \label{eq:11-10-11}\end{equation}
Introduce the relative error of the input $\displaystyle\dfrac{\|\boldsymbol{E}\|}{\|\boldsymbol{A}\|}$. Substitute $\|\boldsymbol{E}\| = \|\boldsymbol{A}\| \cdot \displaystyle\dfrac{\|\boldsymbol{E}\|}{\|\boldsymbol{A}\|}$.
\begin{equation}\dfrac{\|e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}}\|}{\|e^{\boldsymbol{A}}\|} \lesssim \dfrac{\|\boldsymbol{A}\| \cdot \|L(\boldsymbol{A}, \cdot)\|}{\|e^{\boldsymbol{A}}\|} \cdot \dfrac{\|\boldsymbol{E}\|}{\|\boldsymbol{A}\|} \label{eq:11-10-12}\end{equation}
[Definition of the Condition Number]
From $\eqref{eq:11-10-12}$, define the relative condition number $\kappa(e^{\boldsymbol{A}})$ as follows.
\begin{equation}\kappa(e^{\boldsymbol{A}}) = \dfrac{\|\boldsymbol{A}\| \cdot \|L(\boldsymbol{A}, \cdot)\|}{\|e^{\boldsymbol{A}}\|} \label{eq:11-10-13}\end{equation}
Using $\eqref{eq:11-10-13}$, $\eqref{eq:11-10-12}$ can be written as
\begin{equation}\dfrac{\|e^{\boldsymbol{A} + \boldsymbol{E}} - e^{\boldsymbol{A}}\|}{\|e^{\boldsymbol{A}}\|} \lesssim \kappa(e^{\boldsymbol{A}}) \cdot \dfrac{\|\boldsymbol{E}\|}{\|\boldsymbol{A}\|} \label{eq:11-10-14}\end{equation}
$\eqref{eq:11-10-14}$ shows that the relative error of the input is amplified by at most $\kappa(e^{\boldsymbol{A}})$ in the relative error of the output.
[Upper Bound of the Operator Norm]
Estimate an upper bound for $\|L(\boldsymbol{A}, \cdot)\|$. From $\eqref{eq:11-10-6}$,
\begin{equation}\|L(\boldsymbol{A}, \boldsymbol{E})\| = \left\|\displaystyle\int_0^1 e^{s\boldsymbol{A}} \boldsymbol{E}\, e^{(1-s)\boldsymbol{A}} ds\right\| \leq \displaystyle\int_0^1 \|e^{s\boldsymbol{A}}\| \|\boldsymbol{E}\| \|e^{(1-s)\boldsymbol{A}}\| ds \label{eq:11-10-15}\end{equation}
The matrix exponential norm has the upper bound $\|e^{t\boldsymbol{A}}\| \leq e^{t\|\boldsymbol{A}\|}$. Applying this,
\begin{equation}\|L(\boldsymbol{A}, \boldsymbol{E})\| \leq \|\boldsymbol{E}\| \displaystyle\int_0^1 e^{s\|\boldsymbol{A}\|} e^{(1-s)\|\boldsymbol{A}\|} ds = \|\boldsymbol{E}\| \displaystyle\int_0^1 e^{\|\boldsymbol{A}\|} ds = e^{\|\boldsymbol{A}\|} \|\boldsymbol{E}\| \label{eq:11-10-16}\end{equation}
From $\eqref{eq:11-10-16}$, we obtain $\|L(\boldsymbol{A}, \cdot)\| \leq e^{\|\boldsymbol{A}\|}$.
11.10 Gradient of the Matrix Square Root
The solution $\boldsymbol{X}$ of the Sylvester equation $\boldsymbol{S}\boldsymbol{X} + \boldsymbol{X}\boldsymbol{S} = \bar{\boldsymbol{S}}$ is $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}}$
Proof
[Definition of the Matrix Square Root]
When $\boldsymbol{A}$ is symmetric positive definite, there exists a unique symmetric positive definite matrix $\boldsymbol{S}$ satisfying $\boldsymbol{S}^2 = \boldsymbol{A}$. This is called the principal square root.
\begin{equation}\boldsymbol{S} = \boldsymbol{A}^{1/2}, \quad \boldsymbol{S}^2 = \boldsymbol{A} \label{eq:11-11-1}\end{equation}
[Differential Relation of the Forward Pass]
Take the total differential of $\boldsymbol{S}^2 = \boldsymbol{A}$ from $\eqref{eq:11-11-1}$. $\boldsymbol{S}$ also varies as a function of $\boldsymbol{A}$.
\begin{equation}d(\boldsymbol{S}^2) = d\boldsymbol{A} \label{eq:11-11-2}\end{equation}
Apply the product rule (1.25) to $\boldsymbol{S}^2 = \boldsymbol{S} \cdot \boldsymbol{S}$.
\begin{equation}d(\boldsymbol{S}^2) = d\boldsymbol{S} \cdot \boldsymbol{S} + \boldsymbol{S} \cdot d\boldsymbol{S} \label{eq:11-11-3}\end{equation}
Combine $\eqref{eq:11-11-2}$ and $\eqref{eq:11-11-3}$.
\begin{equation}d\boldsymbol{S} \cdot \boldsymbol{S} + \boldsymbol{S} \cdot d\boldsymbol{S} = d\boldsymbol{A} \label{eq:11-11-4}\end{equation}
[Setting Up Backpropagation]
In neural network backpropagation, assume that the gradient of the loss function $L$ with respect to $\boldsymbol{S}$ (the upstream gradient) is known.
\begin{equation}\bar{\boldsymbol{S}} = \dfrac{\partial L}{\partial \boldsymbol{S}} \label{eq:11-11-5}\end{equation}
We want to find the gradient of $L$ with respect to $\boldsymbol{A}$.
\begin{equation}\boldsymbol{X} = \dfrac{\partial L}{\partial \boldsymbol{A}} \label{eq:11-11-6}\end{equation}
[Applying the Chain Rule]
Write the chain rule in differential form. The total differential of $L$ can be expressed as
\begin{equation}dL = \text{tr}\left(\bar{\boldsymbol{S}}^\top d\boldsymbol{S}\right) = \text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) \label{eq:11-11-7}\end{equation}
Here, the matrix inner product is expressed in the form $\text{tr}(\boldsymbol{G}^\top d\boldsymbol{M})$.
[Eliminating $d\boldsymbol{S}$]
From $\eqref{eq:11-11-4}$, the relationship between $d\boldsymbol{A}$ and $d\boldsymbol{S}$ is known. Substitute $\eqref{eq:11-11-4}$ into the right-hand side of $\eqref{eq:11-11-7}$.
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) = \text{tr}\left(\boldsymbol{X}^\top (d\boldsymbol{S} \cdot \boldsymbol{S} + \boldsymbol{S} \cdot d\boldsymbol{S})\right) \label{eq:11-11-8}\end{equation}
Expand using the linearity of the trace.
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) = \text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{S} \cdot \boldsymbol{S}\right) + \text{tr}\left(\boldsymbol{X}^\top \boldsymbol{S} \cdot d\boldsymbol{S}\right) \label{eq:11-11-9}\end{equation}
Apply the cyclic property of the trace (1.12). For the first term,
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{S} \cdot \boldsymbol{S}\right) = \text{tr}\left(\boldsymbol{S} \boldsymbol{X}^\top d\boldsymbol{S}\right) = \text{tr}\left((\boldsymbol{X} \boldsymbol{S}^\top)^\top d\boldsymbol{S}\right) \label{eq:11-11-10}\end{equation}
Since $\boldsymbol{S}$ is symmetric ($\boldsymbol{S}^\top = \boldsymbol{S}$),
\begin{equation}\text{tr}\left((\boldsymbol{X} \boldsymbol{S}^\top)^\top d\boldsymbol{S}\right) = \text{tr}\left((\boldsymbol{X} \boldsymbol{S})^\top d\boldsymbol{S}\right) = \text{tr}\left(\boldsymbol{S}^\top \boldsymbol{X}^\top d\boldsymbol{S}\right) = \text{tr}\left(\boldsymbol{S} \boldsymbol{X}^\top d\boldsymbol{S}\right) \label{eq:11-11-11}\end{equation}
For the second term, similarly,
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top \boldsymbol{S} \cdot d\boldsymbol{S}\right) = \text{tr}\left((\boldsymbol{S}^\top \boldsymbol{X})^\top d\boldsymbol{S}\right) = \text{tr}\left((\boldsymbol{S} \boldsymbol{X})^\top d\boldsymbol{S}\right) \label{eq:11-11-12}\end{equation}
Substitute $\eqref{eq:11-11-11}$ and $\eqref{eq:11-11-12}$ into $\eqref{eq:11-11-9}$.
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) = \text{tr}\left(\boldsymbol{S} \boldsymbol{X}^\top d\boldsymbol{S}\right) + \text{tr}\left((\boldsymbol{S} \boldsymbol{X})^\top d\boldsymbol{S}\right) \label{eq:11-11-13}\end{equation}
Assuming that $\boldsymbol{X}$ is also symmetric (since $\boldsymbol{A}$ is symmetric, its gradient is also symmetric), $\boldsymbol{X}^\top = \boldsymbol{X}$.
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) = \text{tr}\left((\boldsymbol{S} \boldsymbol{X})^\top d\boldsymbol{S}\right) + \text{tr}\left((\boldsymbol{X} \boldsymbol{S})^\top d\boldsymbol{S}\right) \label{eq:11-11-14}\end{equation}
Combine using the linearity of the trace.
\begin{equation}\text{tr}\left(\boldsymbol{X}^\top d\boldsymbol{A}\right) = \text{tr}\left((\boldsymbol{S} \boldsymbol{X} + \boldsymbol{X} \boldsymbol{S})^\top d\boldsymbol{S}\right) \label{eq:11-11-15}\end{equation}
[Identifying the Gradient]
Compare the left-hand side $\text{tr}(\bar{\boldsymbol{S}}^\top d\boldsymbol{S})$ of $\eqref{eq:11-11-7}$ with $\eqref{eq:11-11-15}$.
\begin{equation}\text{tr}\left(\bar{\boldsymbol{S}}^\top d\boldsymbol{S}\right) = \text{tr}\left((\boldsymbol{S} \boldsymbol{X} + \boldsymbol{X} \boldsymbol{S})^\top d\boldsymbol{S}\right) \label{eq:11-11-16}\end{equation}
For $\eqref{eq:11-11-16}$ to hold for all $d\boldsymbol{S}$, the following equality is necessary.
\begin{equation}\boldsymbol{S} \boldsymbol{X} + \boldsymbol{X} \boldsymbol{S} = \bar{\boldsymbol{S}} \label{eq:11-11-17}\end{equation}
[Sylvester Equation]
$\eqref{eq:11-11-17}$ is a Sylvester equation (a special form of the continuous Lyapunov equation) in $\boldsymbol{X}$.
\begin{equation}\boldsymbol{S} \boldsymbol{X} + \boldsymbol{X} \boldsymbol{S} = \bar{\boldsymbol{S}} \label{eq:11-11-18}\end{equation}
When $\boldsymbol{S}$ is positive definite, all eigenvalues of $\boldsymbol{S}$ are positive, so the sum of any pair of eigenvalues of $\boldsymbol{S}$ is positive. Therefore, this equation has a unique solution.
[Solution via Eigendecomposition]
Let the eigendecomposition of $\boldsymbol{S}$ be $\boldsymbol{S} = \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top$, where $\boldsymbol{Q}$ is an orthogonal matrix and $\boldsymbol{\Lambda} = \text{diag}(\lambda_0, \ldots, \lambda_{n-1})$ is the diagonal matrix of eigenvalues.
\begin{equation}\boldsymbol{S} = \boldsymbol{Q} \boldsymbol{\Lambda} \boldsymbol{Q}^\top \label{eq:11-11-19}\end{equation}
Substitute $\eqref{eq:11-11-19}$ into $\eqref{eq:11-11-18}$ and transform with $\tilde{\boldsymbol{X}} = \boldsymbol{Q}^\top \boldsymbol{X} \boldsymbol{Q}$, $\tilde{\bar{\boldsymbol{S}}} = \boldsymbol{Q}^\top \bar{\boldsymbol{S}} \boldsymbol{Q}$ to obtain
\begin{equation}\boldsymbol{\Lambda} \tilde{\boldsymbol{X}} + \tilde{\boldsymbol{X}} \boldsymbol{\Lambda} = \tilde{\bar{\boldsymbol{S}}} \label{eq:11-11-20}\end{equation}
Write $\eqref{eq:11-11-20}$ component-wise.
\begin{equation}\lambda_i \tilde{X}_{ij} + \tilde{X}_{ij} \lambda_j = \tilde{\bar{S}}_{ij} \label{eq:11-11-21}\end{equation}
Solve $\eqref{eq:11-11-21}$ for $\tilde{X}_{ij}$.
\begin{equation}\tilde{X}_{ij} = \dfrac{\tilde{\bar{S}}_{ij}}{\lambda_i + \lambda_j} \label{eq:11-11-22}\end{equation}
Finally, apply the inverse transformation to obtain $\boldsymbol{X}$.
\begin{equation}\boldsymbol{X} = \boldsymbol{Q} \tilde{\boldsymbol{X}} \boldsymbol{Q}^\top = \dfrac{\partial L}{\partial \boldsymbol{A}} \label{eq:11-11-23}\end{equation}
Source: The Sylvester equation originates from J.J. Sylvester (1884) "Sur l'equation en matrices $px = xq$", Comptes Rendus de l'Academie des Sciences 99, 67-71, 115-116.
References
- Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
- Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
- Matrix calculus - Wikipedia