What is the Fisher information matrix?

The Fisher information matrix measures the precision of parameter estimation. It is defined as F(θ) = E[∇log p(x|θ) ∇log p(x|θ)^T], the covariance of the score function. It underpins the Cramér-Rao lower bound and natural gradient methods.

What is the policy gradient theorem?

The policy gradient theorem in reinforcement learning expresses the gradient of expected cumulative reward as ∇J = E[∇log π(a|s) Q(s,a)]. It is the theoretical foundation for REINFORCE and Actor-Critic algorithms.

How are Transformer Attention gradients computed?

For Scaled Dot-Product Attention O = softmax(QK^T/√d)V, gradients are computed via the chain rule through the softmax Jacobian (diagonal minus outer product form). Flash Attention computes these gradients without explicitly storing the attention weights.

SVD, Fisher Information, Reinforcement Learning, NLP

Advanced applications of vector/matrix calculus to machine learning.

Notation Convention
All formulas on this page use the denominator layout. See Layout Conventions for details.

18.1 Singular Value Decomposition (SVD) and Matrix Calculus

Matrix calculus is needed to optimize loss functions involving singular value decomposition. Used in low-rank approximation, matrix completion, principal component analysis, and more.

18.1 Gradient of Low-Rank Approximation

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)$

Conditions: $\boldsymbol{X} \in \mathbb{R}^{m \times n}$, $\boldsymbol{U} \in \mathbb{R}^{m \times k}$, $\boldsymbol{V} \in \mathbb{R}^{n \times k}$ ($k$ is the rank)

Proof

Let $\boldsymbol{A} = \boldsymbol{U}\boldsymbol{V}^\top$. By 12.6, this follows directly.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A}) = 2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top) \label{eq:18-1-1}\end{equation}

Note: In matrix completion problems, this gradient is computed only over observed entries. Used in recommender systems such as the Netflix Prize.

18.2 Gradient with Respect to Factor Matrix $\boldsymbol{U}$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{U}} \|\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top\|_F^2 = -2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)\boldsymbol{V}$

Conditions: $\boldsymbol{X}$ is a constant matrix

Proof

Let $\boldsymbol{E} = \boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top$. The loss function is $L = \|\boldsymbol{E}\|_F^2 = \text{tr}(\boldsymbol{E}^\top\boldsymbol{E})$.

\begin{equation}L = \text{tr}((\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)^\top(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)) \label{eq:18-2-1}\end{equation}

Expanding $\eqref{eq:18-2-1}$:

\begin{equation}L = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) - 2\text{tr}(\boldsymbol{X}^\top\boldsymbol{U}\boldsymbol{V}^\top) + \text{tr}(\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U}\boldsymbol{V}^\top) \label{eq:18-2-2}\end{equation}

Consider only the terms involving $\boldsymbol{U}$. The second term is $\text{tr}(\boldsymbol{V}^\top\boldsymbol{X}^\top\boldsymbol{U})$ (by cyclic property of trace).

By 3.1, $\displaystyle\frac{\partial}{\partial \boldsymbol{U}}\text{tr}(\boldsymbol{V}^\top\boldsymbol{X}^\top\boldsymbol{U}) = \boldsymbol{X}\boldsymbol{V}$.

The third term has the form of 3.5, giving $\displaystyle\frac{\partial}{\partial \boldsymbol{U}}\text{tr}(\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U}\boldsymbol{V}^\top) = 2\boldsymbol{U}\boldsymbol{V}^\top\boldsymbol{V}$.

\begin{equation}\frac{\partial L}{\partial \boldsymbol{U}} = -2\boldsymbol{X}\boldsymbol{V} + 2\boldsymbol{U}\boldsymbol{V}^\top\boldsymbol{V} = -2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)\boldsymbol{V} \label{eq:18-2-3}\end{equation}

Note: From the optimality condition $\nabla_{\boldsymbol{U}} L = 0$, we get $\boldsymbol{U} = \boldsymbol{X}\boldsymbol{V}(\boldsymbol{V}^\top\boldsymbol{V})^{-1}$. If $\boldsymbol{V}$ is orthonormal, then $\boldsymbol{U} = \boldsymbol{X}\boldsymbol{V}$.

18.3 Gradient with Respect to Factor Matrix $\boldsymbol{V}$

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{V}} \|\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top\|_F^2 = -2(\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)^\top\boldsymbol{U}$

Conditions: $\boldsymbol{X}$ is a constant matrix

Proof

Following the same procedure as 18.2, we differentiate with respect to $\boldsymbol{V}$.

\begin{equation}\frac{\partial L}{\partial \boldsymbol{V}} = -2\boldsymbol{X}^\top\boldsymbol{U} + 2\boldsymbol{V}\boldsymbol{U}^\top\boldsymbol{U} = -2(\boldsymbol{X}^\top - \boldsymbol{V}\boldsymbol{U}^\top)\boldsymbol{U} \label{eq:18-3-1}\end{equation}

Since $(\boldsymbol{X}^\top - \boldsymbol{V}\boldsymbol{U}^\top) = (\boldsymbol{X} - \boldsymbol{U}\boldsymbol{V}^\top)^\top$, we obtain the formula.

Note: In Alternating Least Squares (ALS), $\boldsymbol{U}$ and $\boldsymbol{V}$ are updated alternately. Each step is a convex optimization problem, guaranteeing convergence.

18.4 Subgradient of the Nuclear Norm

Formula: $\partial \|\boldsymbol{X}\|_* = \{\boldsymbol{U}\boldsymbol{V}^\top + \boldsymbol{W} : \boldsymbol{U}^\top\boldsymbol{W} = \boldsymbol{0}, \boldsymbol{W}\boldsymbol{V} = \boldsymbol{0}, \|\boldsymbol{W}\|_2 \leq 1\}$

Conditions: $\boldsymbol{X} = \boldsymbol{U}\boldsymbol{\Sigma}\boldsymbol{V}^\top$ is the SVD, $\|\cdot\|_*$ is the nuclear norm (sum of singular values), $\|\cdot\|_2$ is the spectral norm

Proof

The nuclear norm is defined as $\|\boldsymbol{X}\|_* = \sum_{i} \sigma_i = \text{tr}(\boldsymbol{\Sigma})$.

When $\boldsymbol{X}$ is full rank, the subgradient is unique and equals $\boldsymbol{U}\boldsymbol{V}^\top$.

In the rank-deficient case, there is freedom in the null-space projection component $\boldsymbol{W}$.

Note: Nuclear norm minimization is used as a convex relaxation for matrix completion. The subgradient is used when solving via proximal gradient methods.

18.5 Fisher Information Matrix

The Fisher information matrix, which represents the precision of parameter estimation, is computed from the second derivative of the log-likelihood. Used in natural gradient methods and trust-region methods.

18.5 Definition of the Fisher Information Matrix

Formula: $\boldsymbol{F}(\boldsymbol{\theta}) = \mathbb{E}\left[\nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{x}|\boldsymbol{\theta}) \nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{x}|\boldsymbol{\theta})^\top\right]$

Conditions: $p(\boldsymbol{x}|\boldsymbol{\theta})$ is the probability density function, $\boldsymbol{\theta}$ is the parameter vector

Proof

Define the score function of the log-likelihood as $\boldsymbol{s}(\boldsymbol{\theta}) = \nabla_{\boldsymbol{\theta}} \log p(\boldsymbol{x}|\boldsymbol{\theta})$.

\begin{equation}\boldsymbol{F}(\boldsymbol{\theta}) = \mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})\boldsymbol{s}(\boldsymbol{\theta})^\top] = \text{Cov}(\boldsymbol{s}(\boldsymbol{\theta})) \label{eq:18-5-1}\end{equation}

We show that $\mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})] = \boldsymbol{0}$.

\begin{equation}\mathbb{E}[\boldsymbol{s}(\boldsymbol{\theta})] = \int \frac{\nabla_{\boldsymbol{\theta}} p(\boldsymbol{x}|\boldsymbol{\theta})}{p(\boldsymbol{x}|\boldsymbol{\theta})} p(\boldsymbol{x}|\boldsymbol{\theta}) d\boldsymbol{x} = \nabla_{\boldsymbol{\theta}} \int p(\boldsymbol{x}|\boldsymbol{\theta}) d\boldsymbol{x} = \nabla_{\boldsymbol{\theta}} 1 = \boldsymbol{0} \label{eq:18-5-2}\end{equation}

Note: The Fisher information matrix is positive semidefinite. The diagonal entry $F_{ii}$ represents the information about parameter $\theta_i$.

18.6 Relationship Between Fisher Information and the Hessian

Formula: $\boldsymbol{F}(\boldsymbol{\theta}) = -\mathbb{E}\left[\nabla_{\boldsymbol{\theta}}^2 \log p(\boldsymbol{x}|\boldsymbol{\theta})\right]$

Conditions: Regularity conditions hold (interchange of differentiation and integration is valid)

Proof

Differentiate $\nabla_{\boldsymbol{\theta}} \log p = \displaystyle\frac{\nabla_{\boldsymbol{\theta}} p}{p}$ once more.

\begin{equation}\nabla_{\boldsymbol{\theta}}^2 \log p = \frac{\nabla_{\boldsymbol{\theta}}^2 p}{p} - \frac{\nabla_{\boldsymbol{\theta}} p (\nabla_{\boldsymbol{\theta}} p)^\top}{p^2} \label{eq:18-6-1}\end{equation}

Taking the expectation, the first term becomes $\nabla_{\boldsymbol{\theta}}^2 \int p \, d\boldsymbol{x} = \boldsymbol{0}$.

\begin{equation}\mathbb{E}[\nabla_{\boldsymbol{\theta}}^2 \log p] = -\mathbb{E}\left[\frac{\nabla_{\boldsymbol{\theta}} p (\nabla_{\boldsymbol{\theta}} p)^\top}{p^2}\right] = -\boldsymbol{F}(\boldsymbol{\theta}) \label{eq:18-6-2}\end{equation}

Note: This relationship allows the Fisher information matrix to be computed as the negated expected Hessian of the log-likelihood. It provides a connection to Newton's method.

18.7 Natural Gradient

Formula: $\tilde{\nabla}_{\boldsymbol{\theta}} L = \boldsymbol{F}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} L$

Conditions: $L$ is the loss function, $\boldsymbol{F}(\boldsymbol{\theta})$ is nonsingular

Proof

Consider the parameter space as a Riemannian manifold. The Fisher information matrix serves as the metric tensor.

Transform the Euclidean gradient $\nabla_{\boldsymbol{\theta}} L$ using the metric $\boldsymbol{F}$ to obtain the natural gradient.

\begin{equation}\tilde{\nabla}_{\boldsymbol{\theta}} L = \boldsymbol{F}(\boldsymbol{\theta})^{-1} \nabla_{\boldsymbol{\theta}} L \label{eq:18-7-1}\end{equation}

This corresponds to the steepest descent direction subject to a KL divergence constraint.

Note: Natural gradient descent is invariant under reparametrization. It forms the theoretical foundation for reinforcement learning algorithms such as TRPO and PPO.

18.8 Cramér-Rao Lower Bound

Formula: $\text{Var}(\hat{\theta}_i) \geq [\boldsymbol{F}(\boldsymbol{\theta})^{-1}]_{ii}$

Conditions: $\hat{\boldsymbol{\theta}}$ is an unbiased estimator

Proof

By the definition of an unbiased estimator, $\mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{\theta}$. Differentiating both sides with respect to $\boldsymbol{\theta}$:

\begin{equation}\nabla_{\boldsymbol{\theta}} \mathbb{E}[\hat{\boldsymbol{\theta}}] = \boldsymbol{I} \label{eq:18-8-1}\end{equation}

Applying the Cauchy-Schwarz inequality yields the lower bound.

\begin{equation}\text{Cov}(\hat{\boldsymbol{\theta}}) \succeq \boldsymbol{F}(\boldsymbol{\theta})^{-1} \label{eq:18-8-2}\end{equation}

Note: An estimator achieving this bound is called efficient. The maximum likelihood estimator is asymptotically efficient.

18.9 Reinforcement Learning and Policy Gradients

Policy gradient methods in reinforcement learning estimate the gradient of expected cumulative reward. Matrix calculus is used for gradient computation in policy networks.

18.9 Policy Gradient Theorem

Formula: $\nabla_{\boldsymbol{\theta}} J(\boldsymbol{\theta}) = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot Q^{\pi_{\boldsymbol{\theta}}}(s, a)\right]$

Conditions: $J(\boldsymbol{\theta})$ is the expected cumulative reward, $\pi_{\boldsymbol{\theta}}(a|s)$ is the policy, $Q^{\pi}(s, a)$ is the action-value function

Proof

The expected cumulative reward is defined as follows.

\begin{equation}J(\boldsymbol{\theta}) = \mathbb{E}_{\pi_{\boldsymbol{\theta}}}\left[\sum_{t=0}^{\infty} \gamma^t r_t\right] \label{eq:18-9-1}\end{equation}

Rewriting using the state distribution $d^{\pi}(s)$:

\begin{equation}J(\boldsymbol{\theta}) = \sum_s d^{\pi}(s) \sum_a \pi_{\boldsymbol{\theta}}(a|s) Q^{\pi}(s, a) \label{eq:18-9-2}\end{equation}

Differentiating with respect to $\boldsymbol{\theta}$, using $\nabla_{\boldsymbol{\theta}} \pi = \pi \nabla_{\boldsymbol{\theta}} \log \pi$:

\begin{equation}\nabla_{\boldsymbol{\theta}} J = \sum_s d^{\pi}(s) \sum_a \pi_{\boldsymbol{\theta}}(a|s) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) Q^{\pi}(s, a) \label{eq:18-9-3}\end{equation}

Note: The REINFORCE algorithm estimates this gradient via Monte Carlo sampling. A baseline is used for variance reduction.

18.10 Variance Reduction via Baseline

Conditions: $b(s)$ is a baseline function that depends only on state $s$

Proof

We show that the expected value of the baseline term is zero.

\begin{equation}\mathbb{E}_{a \sim \pi}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot b(s)\right] = b(s) \sum_a \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a|s) \label{eq:18-10-1}\end{equation}

Since $\sum_a \pi_{\boldsymbol{\theta}}(a|s) = 1$, we have $\sum_a \nabla_{\boldsymbol{\theta}} \pi_{\boldsymbol{\theta}}(a|s) = 0$.

Therefore $\eqref{eq:18-10-1}$ is zero, and the baseline does not introduce bias into the gradient.

Note: The optimal baseline is $b^*(s) = \displaystyle\frac{\mathbb{E}[\|\nabla \log \pi\|^2 Q]}{\mathbb{E}[\|\nabla \log \pi\|^2]}$. In practice, the value function $V(s)$ is commonly used.

18.11 Actor-Critic Gradient

Formula: $\nabla_{\boldsymbol{\theta}} J = \mathbb{E}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot A^{\pi}(s, a)\right]$

Conditions: $A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$ is the advantage function

Proof

Using the baseline $b(s) = V^{\pi}(s)$ in 18.10 yields the result.

\begin{equation}\nabla_{\boldsymbol{\theta}} J = \mathbb{E}\left[\nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a|s) \cdot (Q^{\pi}(s, a) - V^{\pi}(s))\right] \label{eq:18-11-1}\end{equation}

$A^{\pi}(s, a) = Q^{\pi}(s, a) - V^{\pi}(s)$ is called the advantage function.

Note: The Actor learns the policy $\pi_{\boldsymbol{\theta}}$ and the Critic learns the value function $V_{\boldsymbol{w}}$. Used in A2C, A3C, PPO, and others.

18.12 PPO Clipped Objective

Formula: $L^{\text{CLIP}}(\boldsymbol{\theta}) = \mathbb{E}\left[\min\left(r_t(\boldsymbol{\theta}) A_t, \text{clip}(r_t(\boldsymbol{\theta}), 1-\epsilon, 1+\epsilon) A_t\right)\right]$

Conditions: $r_t(\boldsymbol{\theta}) = \displaystyle\frac{\pi_{\boldsymbol{\theta}}(a_t|s_t)}{\pi_{\boldsymbol{\theta}_{\text{old}}}(a_t|s_t)}$ is the probability ratio, $\epsilon \approx 0.2$

Proof

Through importance sampling, the gradient of the new policy is estimated from samples of the old policy.

\begin{equation}\nabla_{\boldsymbol{\theta}} J = \mathbb{E}_{\pi_{\text{old}}}\left[r_t(\boldsymbol{\theta}) \nabla_{\boldsymbol{\theta}} \log \pi_{\boldsymbol{\theta}}(a_t|s_t) A_t\right] \label{eq:18-12-1}\end{equation}

The clip function halts the gradient when $r_t$ falls outside $[1-\epsilon, 1+\epsilon]$, preventing drastic policy changes.

Note: PPO (Proximal Policy Optimization) is simple to implement and provides stable training. Successfully used in OpenAI Five and others.

18.13 Natural Language Processing and Attention

Matrix calculus in the Attention mechanism of the Transformer. Gradient computation for softmax and multi-head attention.

18.13 Gradient of Attention Weights

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{Q}} \text{Attention}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \displaystyle\frac{1}{\sqrt{d_k}} (\boldsymbol{I} - \boldsymbol{A})\text{diag}(\boldsymbol{A}) \cdot \displaystyle\frac{\partial L}{\partial \boldsymbol{O}} \cdot \boldsymbol{V}^\top \cdot \boldsymbol{K}$

Conditions: $\boldsymbol{A} = \text{softmax}(\boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k})$, $\boldsymbol{O} = \boldsymbol{A}\boldsymbol{V}$ is the output

Proof

Recall the definition of Scaled Dot-Product Attention.

\begin{equation}\boldsymbol{O} = \text{softmax}\left(\frac{\boldsymbol{Q}\boldsymbol{K}^\top}{\sqrt{d_k}}\right)\boldsymbol{V} = \boldsymbol{A}\boldsymbol{V} \label{eq:18-13-1}\end{equation}

Apply the chain rule, using the softmax Jacobian (6.2).

\begin{equation}\frac{\partial L}{\partial \boldsymbol{Q}} = \frac{\partial L}{\partial \boldsymbol{O}} \cdot \frac{\partial \boldsymbol{O}}{\partial \boldsymbol{A}} \cdot \frac{\partial \boldsymbol{A}}{\partial \boldsymbol{S}} \cdot \frac{\partial \boldsymbol{S}}{\partial \boldsymbol{Q}} \label{eq:18-13-2}\end{equation}

Here $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}$ is the score matrix.

Note: In practice, memory-efficient methods such as Flash Attention compute gradients without explicitly storing $\boldsymbol{A}$.

18.14 Multi-Head Attention Gradient

Formula: $\displaystyle\frac{\partial L}{\partial \boldsymbol{W}_i^Q} = \boldsymbol{X}^\top \displaystyle\frac{\partial L}{\partial \boldsymbol{Q}_i}$

Conditions: $\boldsymbol{Q}_i = \boldsymbol{X}\boldsymbol{W}_i^Q$ is the Query of the $i$-th head

Proof

The output of multi-head attention consists of concatenated heads and an output projection.

\begin{equation}\text{MultiHead}(\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\boldsymbol{W}^O \label{eq:18-14-1}\end{equation}

Each head is $\text{head}_i = \text{Attention}(\boldsymbol{X}\boldsymbol{W}_i^Q, \boldsymbol{X}\boldsymbol{W}_i^K, \boldsymbol{X}\boldsymbol{W}_i^V)$.

The gradient with respect to $\boldsymbol{W}_i^Q$ takes the form of 3.1.

Note: A typical Transformer uses $h = 8$ or $h = 12$ heads, with $d_k = d_v = d_{\text{model}} / h$.

18.15 Layer Normalization Gradient

Formula: $\displaystyle\frac{\partial L}{\partial \boldsymbol{x}} = \displaystyle\frac{\gamma}{\sigma}\left(\displaystyle\frac{\partial L}{\partial \hat{\boldsymbol{x}}} - \displaystyle\frac{1}{d}\mathbf{1}\mathbf{1}^\top\displaystyle\frac{\partial L}{\partial \hat{\boldsymbol{x}}} - \displaystyle\frac{\hat{\boldsymbol{x}}}{d}\hat{\boldsymbol{x}}^\top\displaystyle\frac{\partial L}{\partial \hat{\boldsymbol{x}}}\right)$

Conditions: $\hat{\boldsymbol{x}} = (\boldsymbol{x} - \mu\mathbf{1})/\sigma$, $\mu = \displaystyle\frac{1}{d}\sum_i x_i$, $\sigma = \sqrt{\displaystyle\frac{1}{d}\sum_i (x_i - \mu)^2 + \epsilon}$

Proof

Layer Normalization normalizes along the feature dimension.

\begin{equation}\text{LayerNorm}(\boldsymbol{x}) = \gamma \odot \hat{\boldsymbol{x}} + \beta \label{eq:18-15-1}\end{equation}

Since $\mu$ and $\sigma$ depend on all components of $\boldsymbol{x}$, the gradient computation is involved.

Derived by a procedure similar to 17.5.

Note: Transformers use Layer Normalization instead of Batch Normalization. It does not depend on batch statistics, making it suitable for variable-length sequences.

18.16 Embedding Layer Gradient

Formula: $\displaystyle\frac{\partial L}{\partial \boldsymbol{E}_{[i,:]}} = \displaystyle\frac{\partial L}{\partial \boldsymbol{e}}$ (when the $i$-th token is selected)

Conditions: $\boldsymbol{E} \in \mathbb{R}^{V \times d}$ is the embedding matrix, $V$ is the vocabulary size, $d$ is the embedding dimension

Proof

The embedding layer is a matrix product with a one-hot vector $\boldsymbol{o}_i$.

\begin{equation}\boldsymbol{e} = \boldsymbol{E}^\top \boldsymbol{o}_i = \boldsymbol{E}_{[i,:]} \label{eq:18-16-1}\end{equation}

The gradient propagates only to the selected row.

\begin{equation}\frac{\partial L}{\partial \boldsymbol{E}_{[j,:]}} = \begin{cases} \frac{\partial L}{\partial \boldsymbol{e}} & \text{if } j = i \\ \boldsymbol{0} & \text{otherwise} \end{cases} \label{eq:18-16-2}\end{equation}

Note: When the same token appears multiple times in a batch, the gradients are accumulated. Sparse gradient updates can improve efficiency.

18.17 Cross-Entropy Loss Gradient (Logits)

Formula: $\displaystyle\frac{\partial L}{\partial \boldsymbol{z}} = \text{softmax}(\boldsymbol{z}) - \boldsymbol{y}$

Conditions: $L = -\sum_i y_i \log(\text{softmax}(\boldsymbol{z})_i)$, $\boldsymbol{y}$ is a one-hot label

Proof

Let $\boldsymbol{p} = \text{softmax}(\boldsymbol{z})$. The cross-entropy is $L = -\sum_i y_i \log p_i$.

Apply the chain rule.

\begin{equation}\frac{\partial L}{\partial z_j} = \sum_i \frac{\partial L}{\partial p_i} \frac{\partial p_i}{\partial z_j} \label{eq:18-17-1}\end{equation}

Substituting $\displaystyle\frac{\partial L}{\partial p_i} = -\displaystyle\frac{y_i}{p_i}$ and the softmax Jacobian yields $p_j - y_j$.

Note: This gradient is remarkably simple: the difference between the predicted probability and the target. Used for next-token prediction in language models.

18.18 Additional Advanced Topics

Further applications of matrix calculus.

18.18 Gaussian Process Log Marginal Likelihood Gradient

Formula: $\displaystyle\frac{\partial}{\partial \theta} \log p(\boldsymbol{y}|\boldsymbol{X}, \theta) = \displaystyle\frac{1}{2}\text{tr}\left((\boldsymbol{\alpha}\boldsymbol{\alpha}^\top - \boldsymbol{K}^{-1})\displaystyle\frac{\partial \boldsymbol{K}}{\partial \theta}\right)$

Conditions: $\boldsymbol{\alpha} = \boldsymbol{K}^{-1}\boldsymbol{y}$, $\boldsymbol{K}$ is the kernel matrix, $\theta$ is a hyperparameter

Proof

The log marginal likelihood of a Gaussian process is:

\begin{equation}\log p(\boldsymbol{y}|\boldsymbol{X}, \theta) = -\frac{1}{2}\boldsymbol{y}^\top\boldsymbol{K}^{-1}\boldsymbol{y} - \frac{1}{2}\log|\boldsymbol{K}| - \frac{n}{2}\log(2\pi) \label{eq:18-18-1}\end{equation}

Differentiate using 7.2 and 8.1.

Note: Used for hyperparameter optimization in Gaussian process regression. The computational cost is $O(n^3)$ but can be reduced via low-rank approximations.

18.19 Independent Component Analysis (ICA) Gradient

Formula: $\nabla_{\boldsymbol{W}} J = \left(\boldsymbol{I} - \mathbb{E}[\boldsymbol{g}(\boldsymbol{y})\boldsymbol{y}^\top]\right)\boldsymbol{W}$

Conditions: $\boldsymbol{y} = \boldsymbol{W}\boldsymbol{x}$ is the separated signal, $\boldsymbol{g}$ is a nonlinear function

Proof

The ICA objective maximizes non-Gaussianity. We derive the fixed-point iteration of the FastICA algorithm.

Common choices are $\boldsymbol{g}(y) = \tanh(y)$ or $\boldsymbol{g}(y) = y \exp(-y^2/2)$.

Note: Used in blind source separation, EEG analysis, and more. Whitening is often applied as a preprocessing step.

18.20 NLP Gradient Formulas

Gradients of objective functions used in word embeddings (Word2Vec, GloVe) and contrastive learning (InfoNCE).

18.20 Skip-gram (Negative Sampling) Gradient

Formula: $\dfrac{\partial L}{\partial \boldsymbol{w}_c} = (\sigma(\boldsymbol{w}_c^\top \boldsymbol{w}_o) - 1)\boldsymbol{w}_o + \sum_{k} \sigma(\boldsymbol{w}_c^\top \boldsymbol{w}_k)\boldsymbol{w}_k$

Conditions: $\boldsymbol{w}_c$: center word vector, $\boldsymbol{w}_o$: context word vector, $\boldsymbol{w}_k$: negative sample vector, $\sigma$: sigmoid function

Proof

The Skip-gram loss with negative sampling is defined as:

\begin{equation}L = -\log\sigma(\boldsymbol{w}_c^\top \boldsymbol{w}_o) - \sum_{k=1}^{K}\log\sigma(-\boldsymbol{w}_c^\top \boldsymbol{w}_k) \label{eq:18-20-1}\end{equation}

Using $\sigma'(x) = \sigma(x)(1-\sigma(x))$ and $\frac{d}{dx}\log\sigma(x) = 1 - \sigma(x)$, differentiate with respect to $\boldsymbol{w}_c$.

Note: An efficient training algorithm for Word2Vec. Negative sampling reduces computational cost compared to the full softmax.

18.21 GloVe Gradient

Formula: $\dfrac{\partial J}{\partial \boldsymbol{w}_i} = \sum_{j} f(X_{ij})(\boldsymbol{w}_i^\top \tilde{\boldsymbol{w}}_j + b_i + \tilde{b}_j - \log X_{ij})\tilde{\boldsymbol{w}}_j$

Conditions: $X_{ij}$: co-occurrence matrix, $f$: weighting function, $\boldsymbol{w}_i, \tilde{\boldsymbol{w}}_j$: word vectors, $b_i, \tilde{b}_j$: biases

Proof

The GloVe loss function is designed to predict the logarithm of the co-occurrence matrix.

\begin{equation}J = \sum_{i,j} f(X_{ij})(\boldsymbol{w}_i^\top \tilde{\boldsymbol{w}}_j + b_i + \tilde{b}_j - \log X_{ij})^2 \label{eq:18-21-1}\end{equation}

Differentiate as a quadratic form with respect to $\boldsymbol{w}_i$ to obtain the gradient.

Note: GloVe is a word embedding method that leverages global co-occurrence statistics. The weighting function $f(x) = \min(1, (x/x_{\max})^{0.75})$ is commonly used.

18.22 InfoNCE Loss Gradient

Formula: $\mathcal{L}_{\text{NCE}} = -\log \dfrac{\exp(\boldsymbol{q}^\top \boldsymbol{k}^+ / \tau)}{\exp(\boldsymbol{q}^\top \boldsymbol{k}^+ / \tau) + \sum_{j=1}^{K} \exp(\boldsymbol{q}^\top \boldsymbol{k}^-_j / \tau)}$

Conditions: $\boldsymbol{q}$: query, $\boldsymbol{k}^+$: positive key, $\boldsymbol{k}^-_j$: negative keys, $\tau$: temperature parameter

Proof

The gradient with respect to the query $\boldsymbol{q}$:

\begin{equation}\frac{\partial \mathcal{L}_{\text{NCE}}}{\partial \boldsymbol{q}} = \frac{1}{\tau}\left(-\boldsymbol{k}^+ + \sum_{i} p_i \boldsymbol{k}_i\right) \label{eq:18-22-1}\end{equation}

where $p_i = \text{softmax}(\boldsymbol{q}^\top \boldsymbol{k}_i / \tau)$.

The gradient with respect to the positive key $\boldsymbol{k}^+$:

\begin{equation}\frac{\partial \mathcal{L}_{\text{NCE}}}{\partial \boldsymbol{k}^+} = \frac{1}{\tau}(p_+ - 1)\boldsymbol{q} \label{eq:18-22-2}\end{equation}

Note: The foundational loss function for contrastive learning (SimCLR, CLIP, MoCo, etc.). The temperature parameter $\tau$ controls the sharpness of the distribution.