SVD, Fisher Information, Reinforcement Learning, NLP
Advanced applications of vector/matrix calculus to machine learning.
All formulas on this page use the denominator layout. See Layout Conventions for details.
18.1 Singular Value Decomposition (SVD) and Matrix Calculus
Matrix calculus is needed to optimize loss functions involving singular value decomposition. Used in low-rank approximation, matrix completion, principal component analysis, and more.
18.1 Gradient of Low-Rank Approximation
Proof
Let $\boldsymbol{A} = \boldsymbol{U}\boldsymbol{V}^\top$. By 12.6, this follows directly.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A}) = 2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top) \label{eq:18-1-1}\end{equation}
18.2 Gradient with Respect to Factor Matrix $\boldsymbol{U}$
Proof
Let $\boldsymbol{E} = \boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top$. The loss function is $L = \|\boldsymbol{E}\|_F^2 = \text{tr}(\boldsymbol{E}^\top\boldsymbol{E})$.
\begin{equation}L = \text{tr}((\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)^\top(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)) \label{eq:18-2-1}\end{equation}
Expanding $\eqref{eq:18-2-1}$:
\begin{equation}L = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) - 2\text{tr}(\boldsymbol{X}^\top\boldsymbol{U}\boldsymbol{V}^\top) + \text{tr}(\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U}\boldsymbol{V}^\top) \label{eq:18-2-2}\end{equation}
Consider only the terms involving $\boldsymbol{U}$. The second term is $\text{tr}(\boldsymbol{V}^\top\boldsymbol{X}^\top\boldsymbol{U})$ (by cyclic property of trace).
By 3.1, $\displaystyle\frac{\partial}{\partial \boldsymbol{U}}\text{tr}(\boldsymbol{V}^\top\boldsymbol{X}^\top\boldsymbol{U}) = \boldsymbol{X}\boldsymbol{V}$.
The third term has the form of 3.5, giving $\displaystyle\frac{\partial}{\partial \boldsymbol{U}}\text{tr}(\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U}\boldsymbol{V}^\top) = 2\boldsymbol{U}\boldsymbol{V}^\top\boldsymbol{V}$.
\begin{equation}\frac{\partial L}{\partial \boldsymbol{U}} = -2\boldsymbol{X}\boldsymbol{V} + 2\boldsymbol{U}\boldsymbol{V}^\top\boldsymbol{V} = -2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)\boldsymbol{V} \label{eq:18-2-3}\end{equation}
18.3 Gradient with Respect to Factor Matrix $\boldsymbol{V}$
Proof
Following the same procedure as 18.2, we differentiate with respect to $\boldsymbol{V}$.
\begin{equation}\frac{\partial L}{\partial \boldsymbol{V}} = -2\boldsymbol{X}^\top\boldsymbol{U} + 2\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U} = -2(\boldsymbol{X}^\top - \boldsymbol{V}\boldsymbol{U}^\top)\boldsymbol{U} \label{eq:18-3-1}\end{equation}
Since $(\boldsymbol{X}^\top - \boldsymbol{V}\boldsymbol{U}^\top) = (\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)^\top$, we obtain the formula.
18.4 Subgradient of the Nuclear Norm
Proof
The nuclear norm is defined as $\|\boldsymbol{X}\|_* = \sum_{i} \sigma_i = \text{tr}(\boldsymbol{\Sigma})$.
When $\boldsymbol{X}$ is full rank, the subgradient is unique and equals $\boldsymbol{U}\boldsymbol{V}^\top$.
In the rank-deficient case, there is freedom in the null-space projection component $\boldsymbol{W}$.
18.5 Fisher Information Matrix
The Fisher information matrix, which represents the precision of parameter estimation, is computed from the second derivative of the log-likelihood. Used in natural gradient methods and trust-region methods.
18.5 Definition of the Fisher Information Matrix
Proof
Define the score function of the log-likelihood as $\boldsymbol{s}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{x}|\boldsymbol{\theta})$.
\begin{equation}\boldsymbol{F}(\boldsymbol{\theta}) = \mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})\boldsymbol{s}(\boldsymbol{\theta})^\top] = \text{Cov}(\boldsymbol{s}(\boldsymbol{\theta})) \label{eq:18-5-1}\end{equation}
We show that $\mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})] = \boldsymbol{0}$.
\begin{equation}\mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})] = \int \frac{\nabla_{\boldsymbol{\theta}} p(\boldsymbol{x}|\boldsymbol{\theta})}{p(\boldsymbol{x}|\boldsymbol{\theta})} p(\boldsymbol{x}|\boldsymbol{\theta}) d\boldsymbol{x} = \nabla_{\boldsymbol{\theta}} \int p(\boldsymbol{x}|\boldsymbol{\theta}) d\boldsymbol{x} = \nabla_{\boldsymbol{\theta}} 1 = \boldsymbol{0} \label{eq:18-5-2}\end{equation}
18.6 Relationship Between Fisher Information and the Hessian
Proof
Differentiate $\nabla_{\boldsymbol{\theta}} \log p = \displaystyle\frac{\nabla_{\boldsymbol{\theta}} p}{p}$ once more.
\begin{equation}\nabla_{\boldsymbol{\theta}}^2 \log p = \frac{\nabla_{\boldsymbol{\theta}}^2 p}{p} - \frac{\nabla_{\boldsymbol{\theta}} p (\nabla_{\boldsymbol{\theta}} p)^\top}{p^2} \label{eq:18-6-1}\end{equation}
Taking the expectation, the first term becomes $\nabla_{\boldsymbol{\theta}}^2 \int p \, d\boldsymbol{x} = \boldsymbol{0}$.
\begin{equation}\mathbb{E}[\nabla_{\boldsymbol{\theta}}^2 \log p] = -\mathbb{E}\left[\frac{\nabla_{\boldsymbol{\theta}} p (\nabla_{\boldsymbol{\theta}} p)^\top}{p^2}\right] = -\boldsymbol{F}(\boldsymbol{\theta}) \label{eq:18-6-2}\end{equation}
18.7 Natural Gradient
Proof
Consider the parameter space as a Riemannian manifold. The Fisher information matrix serves as the metric tensor.
Transform the Euclidean gradient $\nabla_{\boldsymbol{\theta}} L$ using the metric $\boldsymbol{F}$ to obtain the natural gradient.
\begin{equation}\tilde{\nabla}_{\boldsymbol{\theta}} L = \boldsymbol{F}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} L \label{eq:18-7-1}\end{equation}
This corresponds to the steepest descent direction subject to a KL divergence constraint.
18.8 Cramér-Rao Lower Bound
Proof
By the definition of an unbiased estimator, $\mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}$. Differentiating both sides with respect to $\boldsymbol{\theta}$:
\begin{equation}\nabla_{\boldsymbol{\theta}} \mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{I} \label{eq:18-8-1}\end{equation}
Applying the Cauchy-Schwarz inequality yields the lower bound.
\begin{equation}\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \boldsymbol{F}(\boldsymbol{\theta})^{-1} \label{eq:18-8-2}\end{equation}
18.9 Reinforcement Learning and Policy Gradients
Policy gradient methods in reinforcement learning estimate the gradient of expected cumulative reward. Matrix calculus is used for gradient computation in policy networks.
18.9 Policy Gradient Theorem
Proof
The expected cumulative reward is defined as follows.
\begin{equation}J(\boldsymbol{\theta}) = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] \label{eq:18-9-1}\end{equation}
Rewriting using the state distribution $d^{\pi}(s)$:
\begin{equation}J(\boldsymbol{\theta}) = \sum_s d^{\pi}(s) \sum_a \pi_{\boldsymbol{\theta}}(a|s) Q^{\pi}(s, a) \label{eq:18-9-2}\end{equation}
Differentiating with respect to $\boldsymbol{\theta}$, using $\nabla_{\boldsymbol{\theta}} \pi = \pi \nabla_{\boldsymbol{\theta}} \log \pi$:
\begin{equation}\nabla_{\boldsymbol{\theta}} J = \sum_s d^{\pi}(s) \sum_a \pi_{\boldsymbol{\theta}}(a|s) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) Q^{\pi}(s, a) \label{eq:18-9-3}\end{equation}
18.10 Variance Reduction via Baseline
Proof
We show that the expected value of the baseline term is zero.
\begin{equation}\mathbb{E}_{a \sim \pi}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot b(s)\right] = b(s) \sum_a \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a|s) \label{eq:18-10-1}\end{equation}
Since $\sum_a \pi_{\boldsymbol{\theta}}(a|s) = 1$, we have $\sum_a \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a|s) = 0$.
Therefore $\eqref{eq:18-10-1}$ is zero, and the baseline does not introduce bias into the gradient.
18.11 Actor-Critic Gradient
Proof
Using the baseline $b(s) = V^{\pi}(s)$ in 18.10 yields the result.
\begin{equation}\nabla_{\boldsymbol{\theta}} J = \mathbb{E}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot (Q^{\pi}(s, a) - V^{\pi}(s))\right] \label{eq:18-11-1}\end{equation}
$A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$ is called the advantage function.
18.12 PPO Clipped Objective
Proof
Through importance sampling, the gradient of the new policy is estimated from samples of the old policy.
\begin{equation}\nabla_{\boldsymbol{\theta}} J = \mathbb{E}_{\pi_{\text{old}}}\left[r_t(\boldsymbol{\theta}) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t|s_t) A_t\right] \label{eq:18-12-1}\end{equation}
The clip function halts the gradient when $r_t$ falls outside $[1-\epsilon, 1+\epsilon]$, preventing drastic policy changes.
18.13 Natural Language Processing and Attention
Matrix calculus in the Attention mechanism of the Transformer. Gradient computation for softmax and multi-head attention.
18.13 Gradient of Attention Weights
Proof
Recall the definition of Scaled Dot-Product Attention.
\begin{equation}\boldsymbol{O} = \text{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^\top}{\sqrt{d_k}}\right)\boldsymbol{V} = \boldsymbol{A}\boldsymbol{V} \label{eq:18-13-1}\end{equation}
Apply the chain rule, using the softmax Jacobian (6.2).
\begin{equation}\frac{\partial L}{\partial \boldsymbol{Q}} = \frac{\partial L}{\partial \boldsymbol{O}} \cdot \frac{\partial \boldsymbol{O}}{\partial \boldsymbol{A}} \cdot \frac{\partial \boldsymbol{A}}{\partial \boldsymbol{S}} \cdot \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{Q}} \label{eq:18-13-2}\end{equation}
Here $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}$ is the score matrix.
18.14 Multi-Head Attention Gradient
Proof
The output of multi-head attention consists of concatenated heads and an output projection.
\begin{equation}\text{MultiHead}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\boldsymbol{W}^O \label{eq:18-14-1}\end{equation}
Each head is $\text{head}_i = \text{Attention}(\boldsymbol{X}\boldsymbol{W}_i^Q, \boldsymbol{X}\boldsymbol{W}_i^K, \boldsymbol{X}\boldsymbol{W}_i^V)$.
The gradient with respect to $\boldsymbol{W}_i^Q$ takes the form of 3.1.
18.15 Layer Normalization Gradient
Proof
Layer Normalization normalizes along the feature dimension.
\begin{equation}\text{LayerNorm}(\boldsymbol{x}) = \gamma \odot \hat{\boldsymbol{x}} + \beta \label{eq:18-15-1}\end{equation}
Since $\mu$ and $\sigma$ depend on all components of $\boldsymbol{x}$, the gradient computation is involved.
Derived by a procedure similar to 17.5.
18.16 Embedding Layer Gradient
Proof
The embedding layer is a matrix product with a one-hot vector $\boldsymbol{o}_i$.
\begin{equation}\boldsymbol{e} = \boldsymbol{E}^\top \boldsymbol{o}_i = \boldsymbol{E}_{[i,:]} \label{eq:18-16-1}\end{equation}
The gradient propagates only to the selected row.
\begin{equation}\frac{\partial L}{\partial \boldsymbol{E}_{[j,:]}} = \begin{cases} \frac{\partial L}{\partial \boldsymbol{e}} & \text{if } j = i \\ \boldsymbol{0} & \text{otherwise} \end{cases} \label{eq:18-16-2}\end{equation}
18.17 Cross-Entropy Loss Gradient (Logits)
Proof
Let $\boldsymbol{p} = \text{softmax}(\boldsymbol{z})$. The cross-entropy is $L = -\sum_i y_i \log p_i$.
Apply the chain rule.
\begin{equation}\frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial p_i} \frac{\partial p_i}{\partial z_j} \label{eq:18-17-1}\end{equation}
Substituting $\displaystyle\frac{\partial L}{\partial p_i} = -\displaystyle\frac{y_i}{p_i}$ and the softmax Jacobian yields $p_j - y_j$.
18.18 Additional Advanced Topics
Further applications of matrix calculus.
18.18 Gaussian Process Log Marginal Likelihood Gradient
Proof
The log marginal likelihood of a Gaussian process is:
\begin{equation}\log p(\boldsymbol{y}|\boldsymbol{X}, \theta) = -\frac{1}{2}\boldsymbol{y}^\top\boldsymbol{K}^{-1}\boldsymbol{y} - \frac{1}{2}\log|\boldsymbol{K}| - \frac{n}{2}\log(2\pi) \label{eq:18-18-1}\end{equation}
18.19 Independent Component Analysis (ICA) Gradient
Proof
The ICA objective maximizes non-Gaussianity. We derive the fixed-point iteration of the FastICA algorithm.
Common choices are $\boldsymbol{g}(y) = \tanh(y)$ or $\boldsymbol{g}(y) = y \exp(-y^2/2)$.
18.20 NLP Gradient Formulas
Gradients of objective functions used in word embeddings (Word2Vec, GloVe) and contrastive learning (InfoNCE).
18.20 Skip-gram (Negative Sampling) Gradient
Proof
The Skip-gram loss with negative sampling is defined as:
\begin{equation}L = -\log\sigma(\boldsymbol{w}_c^\top \boldsymbol{w}_o) - \sum_{k=1}^{K}\log\sigma(-\boldsymbol{w}_c^\top \boldsymbol{w}_k) \label{eq:18-20-1}\end{equation}
Using $\sigma'(x) = \sigma(x)(1-\sigma(x))$ and $\frac{d}{dx}\log\sigma(x) = 1 - \sigma(x)$, differentiate with respect to $\boldsymbol{w}_c$.
18.21 GloVe Gradient
Proof
The GloVe loss function is designed to predict the logarithm of the co-occurrence matrix.
\begin{equation}J = \sum_{i,j} f(X_{ij})(\boldsymbol{w}_i^\top \tilde{\boldsymbol{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2 \label{eq:18-21-1}\end{equation}
Differentiate as a quadratic form with respect to $\boldsymbol{w}_i$ to obtain the gradient.
18.22 InfoNCE Loss Gradient
Proof
The gradient with respect to the query $\boldsymbol{q}$:
\begin{equation}\frac{\partial \mathcal{L}_{\text{NCE}}}{\partial \boldsymbol{q}} = \frac{1}{\tau}\left(-\boldsymbol{k}^+ + \sum_{i} p_i \boldsymbol{k}_i\right) \label{eq:18-22-1}\end{equation}
where $p_i = \text{softmax}(\boldsymbol{q}^\top \boldsymbol{k}_i / \tau)$.
The gradient with respect to the positive key $\boldsymbol{k}^+$:
\begin{equation}\frac{\partial \mathcal{L}_{\text{NCE}}}{\partial \boldsymbol{k}^+} = \frac{1}{\tau}(p_+ - 1)\boldsymbol{q} \label{eq:18-22-2}\end{equation}