How do you compute neural network gradients using matrix calculus?

For a fully connected layer y=Wx+b, the gradients of the loss L are dL/dW = (dL/dy)x^T and dL/dx = W^T(dL/dy). Applying the chain rule layer by layer is backpropagation.

How are attention mechanism gradients derived?

For Attention(Q,K,V) = softmax(QK^T/sqrt(d))V, the gradients pass through the softmax Jacobian (diagonal minus outer product form) to yield gradients with respect to Q, K, and V.

How do you compute regularization gradients?

L2 regularization lambda*||W||^2/2 has gradient lambda*W. L1 regularization lambda*||W||_1 has subgradient lambda*sign(W). Tikhonov regularization ||Gamma*x||^2 has gradient 2*Gamma^T*Gamma*x.

Neural Networks | ML Applications | Vector/Matrix Calculus Proofs

6.2 Jacobian of softmax

Formula: $\displaystyle\dfrac{\partial\, \mathrm{softmax}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top$
($\boldsymbol{p} = \mathrm{softmax}(\boldsymbol{x})$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $p_i = e^{x_i} / \displaystyle\sum_{k} e^{x_k} > 0$, $\displaystyle\sum_i p_i = 1$

Proof

The $i$-th output of the softmax function is defined as follows.

\begin{equation} p_i = \dfrac{e^{x_i}}{S}, \qquad S = \displaystyle\sum_{k=0}^{N-1} e^{x_k} \label{eq:6-2-1} \end{equation}

Case 1: $i = j$. Applying the quotient rule:

\begin{equation} \dfrac{\partial p_i}{\partial x_i} = \dfrac{e^{x_i} \cdot S - e^{x_i} \cdot e^{x_i}}{S^2} = \dfrac{e^{x_i}}{S} - \dfrac{e^{x_i}}{S} \cdot \dfrac{e^{x_i}}{S} = p_i - p_i^2 = p_i(1 - p_i) \label{eq:6-2-2} \end{equation}

Case 2: $i \neq j$. Since the numerator $e^{x_i}$ does not depend on $x_j$, only the denominator is differentiated.

\begin{equation} \dfrac{\partial p_i}{\partial x_j} = \dfrac{0 \cdot S - e^{x_i} \cdot e^{x_j}}{S^2} = -\dfrac{e^{x_i}}{S} \cdot \dfrac{e^{x_j}}{S} = -p_i \, p_j \label{eq:6-2-3} \end{equation}

Combining both cases using the Kronecker delta $\delta_{ij}$:

\begin{equation} \dfrac{\partial p_i}{\partial x_j} = p_i(\delta_{ij} - p_j) \label{eq:6-2-4} \end{equation}

Writing this in matrix form, since the $(i, j)$ entry of the Jacobian matrix is $p_i \delta_{ij} - p_i p_j$:

\begin{equation} \dfrac{\partial \boldsymbol{p}}{\partial \boldsymbol{x}} = \mathrm{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top \label{eq:6-2-5} \end{equation}

Note: Since the softmax output forms a probability distribution with $\displaystyle\sum_i p_i = 1$, the column sums of the Jacobian matrix are all 0. This is the differential expression of the constraint that the total of the softmax outputs is always 1.

6.3 Derivative of the Sigmoid Function

Formula: $\displaystyle\dfrac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$
($\sigma(x) = 1 / (1 + e^{-x})$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\sigma$ is applied element-wise

Proof

Since $\sigma$ is applied element-wise, the Jacobian matrix is diagonal. It suffices to compute the derivative of the component $\sigma(x_i)$.

\begin{equation} \sigma(x) = \dfrac{1}{1 + e^{-x}} = (1 + e^{-x})^{-1} \label{eq:6-3-1} \end{equation}

Applying the chain rule: let $u = 1 + e^{-x}$, so $\sigma = u^{-1}$:

\begin{equation} \dfrac{d\sigma}{dx} = -u^{-2} \cdot \dfrac{du}{dx} = -(1 + e^{-x})^{-2} \cdot (-e^{-x}) = \dfrac{e^{-x}}{(1 + e^{-x})^2} \label{eq:6-3-2} \end{equation}

Rewriting in terms of $\sigma(x)$: from $\sigma(x) = 1/(1+e^{-x})$ we obtain the following relation.

\begin{equation} 1 - \sigma(x) = 1 - \dfrac{1}{1+e^{-x}} = \dfrac{e^{-x}}{1+e^{-x}} \label{eq:6-3-3} \end{equation}

Therefore:

\begin{equation} \dfrac{d\sigma}{dx} = \dfrac{1}{1+e^{-x}} \cdot \dfrac{e^{-x}}{1+e^{-x}} = \sigma(x)(1 - \sigma(x)) \label{eq:6-3-4} \end{equation}

Applying element-wise to a vector, the Jacobian matrix becomes diagonal.

\begin{equation} \dfrac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x}))) \label{eq:6-3-5} \end{equation}

Note: $\sigma'(x) = \sigma(x)(1-\sigma(x))$ attains its maximum value of $1/4$ at $x = 0$ and decays exponentially to 0 as $|x| \to \infty$. This is one of the causes of the vanishing gradient problem in deep learning.

6.4 Derivative of the tanh Function

Formula: $\displaystyle\dfrac{\partial \tanh(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(1 - \tanh^2(\boldsymbol{x}))$

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\tanh$ is applied element-wise

Proof

We recall the definition of $\tanh$ and its relationship to the sigmoid.

\begin{equation} \tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}} \label{eq:6-4-1} \end{equation}

Applying the quotient rule with $f = e^x - e^{-x}$ and $g = e^x + e^{-x}$:

\begin{equation} \dfrac{d}{dx}\tanh(x) = \dfrac{f'g - fg'}{g^2} = \dfrac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2} \label{eq:6-4-2} \end{equation}

Expanding the numerator:

\begin{equation} (e^x + e^{-x})^2 - (e^x - e^{-x})^2 = 4 e^x e^{-x} = 4 \label{eq:6-4-3} \end{equation}

Here we used the identity $(a+b)^2 - (a-b)^2 = 4ab$ with $a = e^x$, $b = e^{-x}$, so $ab = 1$. Therefore:

\begin{equation} \dfrac{d}{dx}\tanh(x) = \dfrac{4}{(e^x + e^{-x})^2} = 1 - \tanh^2(x) \label{eq:6-4-4} \end{equation}

The last equality can be verified by substituting $\tanh^2(x) = (e^x - e^{-x})^2/(e^x + e^{-x})^2$. Applying element-wise to a vector:

\begin{equation} \dfrac{\partial \tanh(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(1 - \tanh^2(\boldsymbol{x})) \label{eq:6-4-5} \end{equation}

Note: This can also be derived from the sigmoid derivative (6.3) using the relation $\tanh(x) = 2\sigma(2x) - 1$. Since $\tanh'(0) = 1$, the vanishing gradient problem is milder than with the sigmoid.

6.5 Derivative of the ReLU Function

Formula: $\displaystyle\dfrac{\partial\, \mathrm{ReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0})$
($\mathrm{ReLU}(x) = \max(0, x)$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$; at $x = 0$, the subgradient $[0, 1]$ is available and by convention 0 is chosen

Proof

ReLU is a piecewise linear function applied element-wise. We consider each component separately.

\begin{equation} \mathrm{ReLU}(x) = \max(0, x) = \begin{cases} x & (x > 0) \\ 0 & (x \leq 0) \end{cases} \label{eq:6-5-1} \end{equation}

When $x > 0$, $\mathrm{ReLU}(x) = x$ so the derivative is 1. When $x < 0$, $\mathrm{ReLU}(x) = 0$ so the derivative is 0.

\begin{equation} \dfrac{d}{dx}\mathrm{ReLU}(x) = \begin{cases} 1 & (x > 0) \\ 0 & (x < 0) \end{cases} = \mathbf{1}_{x > 0} \label{eq:6-5-2} \end{equation}

At $x = 0$, ReLU is not differentiable, but the subdifferential $[0, 1]$ exists. In deep learning implementations, the convention $\mathrm{ReLU}'(0) = 0$ is used.

Applying element-wise to a vector:

\begin{equation} \dfrac{\partial\, \mathrm{ReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0}) \label{eq:6-5-3} \end{equation}

6.6 Derivative of the Leaky ReLU Function

Formula: $\displaystyle\dfrac{\partial\, \mathrm{LeakyReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0} + \alpha \cdot \mathbf{1}_{x_i \leq 0})$
($\mathrm{LeakyReLU}(x) = \max(\alpha x, x)$, $0 < \alpha < 1$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\alpha \in (0, 1)$ is a constant (typical value $\alpha = 0.01$)

Proof

Leaky ReLU is a variant of ReLU that has a small gradient $\alpha$ in the negative region.

\begin{equation} \mathrm{LeakyReLU}(x) = \begin{cases} x & (x > 0) \\ \alpha x & (x \leq 0) \end{cases} \label{eq:6-6-1} \end{equation}

Differentiating in each region:

\begin{equation} \dfrac{d}{dx}\mathrm{LeakyReLU}(x) = \begin{cases} 1 & (x > 0) \\ \alpha & (x \leq 0) \end{cases} = \mathbf{1}_{x > 0} + \alpha \cdot \mathbf{1}_{x \leq 0} \label{eq:6-6-2} \end{equation}

At $x = 0$, the left derivative $\alpha$ and the right derivative $1$ differ, so the function is strictly not differentiable; in practice, the convention is to use $\alpha$ ($0 < \alpha < 1$).

Applying element-wise to a vector:

\begin{equation} \dfrac{\partial\, \mathrm{LeakyReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0} + \alpha \cdot \mathbf{1}_{x_i \leq 0}) \label{eq:6-6-3} \end{equation}

Note: When $\alpha = 0$, this reduces to ReLU (6.5). The variant where $\alpha$ is a learnable parameter is called PReLU (Parametric ReLU).

6.7 Derivative of Cross-Entropy Loss (softmax + CE)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \bigl(-\boldsymbol{y}^\top \log \boldsymbol{p}\bigr) = \boldsymbol{p} - \boldsymbol{y}$
($\boldsymbol{p} = \mathrm{softmax}(\boldsymbol{x})$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\boldsymbol{y} \in \mathbb{R}^N$ is a one-hot vector ($y_c = 1$, others 0) or a probability distribution ($\displaystyle\sum_i y_i = 1$, $y_i \geq 0$)

Proof

Writing the loss function in component form:

\begin{equation} L = -\boldsymbol{y}^\top \log \boldsymbol{p} = -\displaystyle\sum_{i=0}^{N-1} y_i \log p_i \label{eq:6-7-1} \end{equation}

Taking the partial derivative of $L$ with respect to $x_j$ and applying the chain rule:

\begin{equation} \dfrac{\partial L}{\partial x_j} = -\displaystyle\sum_{i=0}^{N-1} y_i \cdot \dfrac{1}{p_i} \cdot \dfrac{\partial p_i}{\partial x_j} \label{eq:6-7-2} \end{equation}

Substituting the softmax Jacobian (6.2, equation \eqref{eq:6-2-4}): since $\partial p_i / \partial x_j = p_i(\delta_{ij} - p_j)$:

\begin{equation} \dfrac{\partial L}{\partial x_j} = -\displaystyle\sum_{i} y_i \cdot \dfrac{1}{p_i} \cdot p_i(\delta_{ij} - p_j) = -\displaystyle\sum_{i} y_i (\delta_{ij} - p_j) \label{eq:6-7-3} \end{equation}

Expanding the sum: since $\displaystyle\sum_i y_i \delta_{ij} = y_j$:

\begin{equation} \dfrac{\partial L}{\partial x_j} = -y_j + p_j \displaystyle\sum_{i} y_i \label{eq:6-7-4} \end{equation}

When $\boldsymbol{y}$ is a probability distribution, $\displaystyle\sum_i y_i = 1$, so:

\begin{equation} \dfrac{\partial L}{\partial x_j} = p_j - y_j \label{eq:6-7-5} \end{equation}

In vector form:

\begin{equation} \dfrac{\partial L}{\partial \boldsymbol{x}} = \boldsymbol{p} - \boldsymbol{y} \label{eq:6-7-6} \end{equation}

Note: This concise result is specific to the combination of softmax and cross-entropy, and is also numerically stable. In implementations, the log-sum-exp trick is used to compute $\log \mathrm{softmax}$ directly.

6.8 Derivative of Binary Cross-Entropy Loss (sigmoid + BCE)

Formula: $\displaystyle\dfrac{\partial}{\partial x} \mathrm{BCE}(y, \sigma(x)) = \sigma(x) - y$
($\mathrm{BCE} = -y\log\sigma(x) - (1-y)\log(1-\sigma(x))$)

Conditions: $x \in \mathbb{R}$ (scalar), $y \in \{0, 1\}$ is the label

Proof

Writing the loss function explicitly, let $p = \sigma(x)$.

\begin{equation} L = -y \log p - (1 - y)\log(1 - p) \label{eq:6-8-1} \end{equation}

Applying the chain rule with $dp/dx = \sigma(x)(1 - \sigma(x)) = p(1-p)$ (6.3):

\begin{equation} \dfrac{dL}{dx} = \left(-\dfrac{y}{p} + \dfrac{1 - y}{1 - p}\right) \dfrac{dp}{dx} = \left(-\dfrac{y}{p} + \dfrac{1 - y}{1 - p}\right) p(1 - p) \label{eq:6-8-2} \end{equation}

Simplifying each term:

\begin{equation} \dfrac{dL}{dx} = -y(1 - p) + (1 - y)p = -y + yp + p - yp = p - y \label{eq:6-8-3} \end{equation}

Therefore:

\begin{equation} \dfrac{dL}{dx} = \sigma(x) - y \label{eq:6-8-4} \end{equation}

Note: This has the same $\boldsymbol{p} - \boldsymbol{y}$ form as the multi-class case (6.7). Binary classification can be viewed as a special case of softmax with $N = 2$.

6.9 Derivative of the GELU Function

Formula: $\displaystyle\dfrac{\partial\, \mathrm{GELU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\Phi(\boldsymbol{x}) + \boldsymbol{x} \odot \phi(\boldsymbol{x}))$
($\mathrm{GELU}(x) = x \cdot \Phi(x)$, $\Phi$: standard normal CDF, $\phi$: standard normal PDF)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\Phi(x) = \dfrac{1}{2}\bigl[1 + \mathrm{erf}(x/\sqrt{2})\bigr]$, $\phi(x) = \dfrac{1}{\sqrt{2\pi}} e^{-x^2/2}$

Proof

We recall the definition of GELU (Gaussian Error Linear Unit).

\begin{equation} \mathrm{GELU}(x) = x \, \Phi(x) \label{eq:6-9-1} \end{equation}

Applying the product rule: since $\Phi'(x) = \phi(x)$ (the derivative of the CDF is the PDF):

\begin{equation} \dfrac{d}{dx}\mathrm{GELU}(x) = \Phi(x) + x \, \phi(x) \label{eq:6-9-2} \end{equation}

Applying element-wise to a vector:

\begin{equation} \dfrac{\partial\, \mathrm{GELU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\Phi(\boldsymbol{x}) + \boldsymbol{x} \odot \phi(\boldsymbol{x})) \label{eq:6-9-3} \end{equation}

Note: GELU is widely used in Transformer models such as BERT and GPT. It has the asymptotic behavior $\mathrm{GELU}(x) \to x$ (identity) as $x \to +\infty$ and $\mathrm{GELU}(x) \to 0$ as $x \to -\infty$, and can be viewed as a smooth approximation of ReLU.

6.10 Derivative of the Swish (SiLU) Function

Formula: $\displaystyle\dfrac{\partial\, \mathrm{Swish}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) + \boldsymbol{x} \odot \sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x})))$
($\mathrm{Swish}(x) = x \cdot \sigma(x)$)

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$, $\sigma$ is the sigmoid function

Proof

We recall the definition of Swish (SiLU: Sigmoid Linear Unit).

\begin{equation} \mathrm{Swish}(x) = x \, \sigma(x) \label{eq:6-10-1} \end{equation}

Applying the product rule: since $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ (6.3):

\begin{equation} \dfrac{d}{dx}\mathrm{Swish}(x) = \sigma(x) + x \, \sigma'(x) = \sigma(x) + x \, \sigma(x)(1 - \sigma(x)) \label{eq:6-10-2} \end{equation}

Factoring out $\sigma(x)$, this can also be written as:

\begin{equation} \dfrac{d}{dx}\mathrm{Swish}(x) = \sigma(x)\bigl[1 + x(1 - \sigma(x))\bigr] \label{eq:6-10-3} \end{equation}

Applying element-wise to a vector:

\begin{equation} \dfrac{\partial\, \mathrm{Swish}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) + \boldsymbol{x} \odot \sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x}))) \label{eq:6-10-4} \end{equation}

Note: Swish has a similar shape to GELU. While GELU is $x \cdot \Phi(x)$, Swish is $x \cdot \sigma(x)$, and since $\sigma(x)$ approximates $\Phi(1.702 x)$, the two functions are close. It was adopted in EfficientNet and other architectures.

17.1 Weight Gradient of the Fully Connected Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{W}} = \boldsymbol{X}^\top \displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$

Conditions: Forward pass $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W} + \boldsymbol{1}_N \boldsymbol{b}^\top$, $\boldsymbol{X} \in \mathbb{R}^{N \times D_{\text{in}}}$, $\boldsymbol{W} \in \mathbb{R}^{D_{\text{in}} \times D_{\text{out}}}$

Proof

The forward pass of a fully connected layer is expressed as a matrix product.

\begin{equation}\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W} + \boldsymbol{1}_N \boldsymbol{b}^\top \label{eq:17-1-1}\end{equation}

Writing \eqref{eq:17-1-1} in component form, the $(n, j)$ entry of the output is:

\begin{equation}Y_{nj} = \displaystyle\sum_{k=0}^{D_{\text{in}}-1} X_{nk} W_{kj} + b_j \label{eq:17-1-2}\end{equation}

In \eqref{eq:17-1-2}, $Y_{nj}$ depends on $W_{ij}$ only when $k = i$. Computing the partial derivative:

\begin{equation}\dfrac{\partial Y_{nj}}{\partial W_{ij}} = \dfrac{\partial}{\partial W_{ij}} \left( \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \right) = X_{ni} \label{eq:17-1-3}\end{equation}

The loss $L$ depends on $\boldsymbol{W}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_{n,k} \dfrac{\partial L}{\partial Y_{nk}} \dfrac{\partial Y_{nk}}{\partial W_{ij}} \label{eq:17-1-4}\end{equation}

From \eqref{eq:17-1-3}, $\displaystyle\dfrac{\partial Y_{nk}}{\partial W_{ij}}$ is nonzero only when $k = j$. Thus \eqref{eq:17-1-4} simplifies to:

\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_n \dfrac{\partial L}{\partial Y_{nj}} \cdot X_{ni} \label{eq:17-1-5}\end{equation}

Interpreting \eqref{eq:17-1-5} as a matrix product entry: since $(\boldsymbol{X}^\top)_{in} = X_{ni}$:

\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_n (\boldsymbol{X}^\top)_{in} \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{nj} = \left( \boldsymbol{X}^\top \dfrac{\partial L}{\partial \boldsymbol{Y}} \right)_{ij} \label{eq:17-1-6}\end{equation}

Since \eqref{eq:17-1-6} holds for all $(i, j)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{W}} = \boldsymbol{X}^\top \dfrac{\partial L}{\partial \boldsymbol{Y}} \label{eq:17-1-7}\end{equation}

Note: \eqref{eq:17-1-7} is the matrix product of the input $\boldsymbol{X}$ and the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$. The gradients from $N$ samples in the batch are automatically accumulated.

17.2 Bias Gradient of the Fully Connected Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{b}} = \displaystyle\sum_{n=0}^{N-1} \left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{n,:}^\top$

Conditions: Forward pass $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W} + \boldsymbol{1}_N \boldsymbol{b}^\top$, $\boldsymbol{b} \in \mathbb{R}^{D_{\text{out}}}$

Proof

From \eqref{eq:17-1-2} of 17.1, the component representation of the output is:

\begin{equation}Y_{nj} = \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \label{eq:17-2-1}\end{equation}

In \eqref{eq:17-2-1}, $Y_{nj}$ depends on $b_i$ only when $j = i$. Computing the partial derivative:

\begin{equation}\dfrac{\partial Y_{nj}}{\partial b_i} = \dfrac{\partial}{\partial b_i} \left( \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \right) = \delta_{ij} \label{eq:17-2-2}\end{equation}

Here $\delta_{ij}$ is the Kronecker delta, equal to 1 when $j = i$ and 0 otherwise.

The loss $L$ depends on $\boldsymbol{b}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial b_i} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \dfrac{\partial Y_{nj}}{\partial b_i} \label{eq:17-2-3}\end{equation}

Substituting \eqref{eq:17-2-2} into \eqref{eq:17-2-3}: $\delta_{ij}$ retains only the $j = i$ terms.

\begin{equation}\dfrac{\partial L}{\partial b_i} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \delta_{ij} = \displaystyle\sum_n \dfrac{\partial L}{\partial Y_{ni}} \label{eq:17-2-4}\end{equation}

\eqref{eq:17-2-4} is a sum over the batch direction ($n = 0, 1, \ldots, N-1$). In vector form, $\left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{n,:}$ denotes the $n$-th row (row vector) of the gradient matrix.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{b}} = \displaystyle\sum_{n=0}^{N-1} \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{n,:}^\top \label{eq:17-2-5}\end{equation}

Note: \eqref{eq:17-2-5} shows that gradient contributions from all samples in the batch are accumulated. In NumPy, this is computed as np.sum(dY, axis=0).

17.3 Input Gradient of the Fully Connected Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{X}} = \displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}} \boldsymbol{W}^\top$

Conditions: Forward pass $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W}$, $\boldsymbol{X} \in \mathbb{R}^{N \times D_{\text{in}}}$

Proof

Writing the forward pass in component form (the bias term is omitted since it does not affect the input gradient):

\begin{equation}Y_{nj} = \displaystyle\sum_{k=0}^{D_{\text{in}}-1} X_{nk} W_{kj} \label{eq:17-3-1}\end{equation}

In \eqref{eq:17-3-1}, $Y_{nj}$ depends on $X_{mi}$ only when $n = m$ and $k = i$. Computing the partial derivative:

\begin{equation}\dfrac{\partial Y_{nj}}{\partial X_{mi}} = \delta_{nm} W_{ij} \label{eq:17-3-2}\end{equation}

Here $\delta_{nm}$ is the Kronecker delta, equal to 1 when $n = m$ and 0 otherwise.

The loss $L$ depends on $\boldsymbol{X}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \dfrac{\partial Y_{nj}}{\partial X_{mi}} \label{eq:17-3-3}\end{equation}

Substituting \eqref{eq:17-3-2} into \eqref{eq:17-3-3}: $\delta_{nm}$ retains only the $n = m$ terms.

\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \delta_{nm} W_{ij} = \displaystyle\sum_j \dfrac{\partial L}{\partial Y_{mj}} W_{ij} \label{eq:17-3-4}\end{equation}

Interpreting \eqref{eq:17-3-4} as a matrix product entry: since $(\boldsymbol{W}^\top)_{ji} = W_{ij}$:

\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{mj} (\boldsymbol{W}^\top)_{ji} = \left( \dfrac{\partial L}{\partial \boldsymbol{Y}} \boldsymbol{W}^\top \right)_{mi} \label{eq:17-3-5}\end{equation}

Since \eqref{eq:17-3-5} holds for all $(m, i)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{X}} = \dfrac{\partial L}{\partial \boldsymbol{Y}} \boldsymbol{W}^\top \label{eq:17-3-6}\end{equation}

Note: \eqref{eq:17-3-6} is the core of backpropagation. It transforms the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$ by the transpose of the weight matrix $\boldsymbol{W}^\top$ and propagates it to the previous layer. In multi-layer networks, this operation is repeated at each layer.

17.4 Scale Gradient of Batch Normalization

Formula: $\displaystyle\dfrac{\partial L}{\partial \gamma} = \displaystyle\sum_n \displaystyle\dfrac{\partial L}{\partial y_n} \hat{x}_n$

Conditions: $y_n = \gamma \hat{x}_n + \beta$, $\hat{x}_n = \displaystyle\dfrac{x_n - \mu}{\sqrt{\sigma^2 + \epsilon}}$

Proof

The output of batch normalization is obtained by transforming the normalized value $\hat{x}_n$ with scale $\gamma$ and shift $\beta$.

\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-4-1}\end{equation}

In \eqref{eq:17-4-1}, $\gamma$ is a parameter shared across all samples. Taking the partial derivative of $y_n$ with respect to $\gamma$:

\begin{equation}\dfrac{\partial y_n}{\partial \gamma} = \dfrac{\partial}{\partial \gamma} (\gamma \hat{x}_n + \beta) = \hat{x}_n \label{eq:17-4-2}\end{equation}

The loss $L$ depends on $\gamma$ through the outputs $y_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial \gamma} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \dfrac{\partial y_n}{\partial \gamma} \label{eq:17-4-3}\end{equation}

Substituting \eqref{eq:17-4-2} into \eqref{eq:17-4-3} gives the final result.

\begin{equation}\dfrac{\partial L}{\partial \gamma} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \hat{x}_n \label{eq:17-4-4}\end{equation}

Note: \eqref{eq:17-4-4} is the inner product (sum over the batch direction) of the output gradient $\displaystyle\dfrac{\partial L}{\partial y_n}$ and the normalized input $\hat{x}_n$.

17.5 Shift Gradient of Batch Normalization

Formula: $\displaystyle\dfrac{\partial L}{\partial \beta} = \displaystyle\sum_n \displaystyle\dfrac{\partial L}{\partial y_n}$

Conditions: $y_n = \gamma \hat{x}_n + \beta$

Proof

From \eqref{eq:17-4-1} of 17.4, the output of batch normalization is:

\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-5-1}\end{equation}

In \eqref{eq:17-5-1}, $\beta$ is a parameter shared across all samples. Taking the partial derivative of $y_n$ with respect to $\beta$:

\begin{equation}\dfrac{\partial y_n}{\partial \beta} = \dfrac{\partial}{\partial \beta} (\gamma \hat{x}_n + \beta) = 1 \label{eq:17-5-2}\end{equation}

The loss $L$ depends on $\beta$ through the outputs $y_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial \beta} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \dfrac{\partial y_n}{\partial \beta} \label{eq:17-5-3}\end{equation}

Substituting \eqref{eq:17-5-2} into \eqref{eq:17-5-3} gives the final result.

\begin{equation}\dfrac{\partial L}{\partial \beta} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \cdot 1 = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \label{eq:17-5-4}\end{equation}

Note: \eqref{eq:17-5-4} is simply the sum of the output gradients over the batch direction. It has the same structure as the bias gradient in 17.2.

17.6 Input Gradient of Batch Normalization

Formula: $\displaystyle\dfrac{\partial L}{\partial x_i} = \displaystyle\dfrac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( \displaystyle\dfrac{\partial L}{\partial y_i} - \displaystyle\dfrac{1}{N}\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial y_j} - \displaystyle\dfrac{\hat{x}_i}{N}\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial y_j}\hat{x}_j \right)$

Conditions: $\mu = \displaystyle\dfrac{1}{N}\displaystyle\sum_n x_n$, $\sigma^2 = \displaystyle\dfrac{1}{N}\displaystyle\sum_n (x_n - \mu)^2$, $\hat{x}_n = \displaystyle\dfrac{x_n - \mu}{\sqrt{\sigma^2 + \epsilon}}$

Proof

In batch normalization, the input $x_i$ affects the batch statistics $\mu$, $\sigma^2$ and the normalized value $\hat{x}_i$. The computation graph is as follows.

\begin{equation}\mu = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} x_n \label{eq:17-6-1}\end{equation}

\begin{equation}\sigma^2 = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} (x_n - \mu)^2 \label{eq:17-6-2}\end{equation}

\begin{equation}\hat{x}_n = \dfrac{x_n - \mu}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-3}\end{equation}

\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-6-4}\end{equation}

[Step 1: Gradient with respect to $\hat{x}_i$]

From \eqref{eq:17-6-4}, taking the partial derivative of $y_i$ with respect to $\hat{x}_i$:

\begin{equation}\dfrac{\partial y_i}{\partial \hat{x}_i} = \gamma \label{eq:17-6-5}\end{equation}

By the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial \hat{x}_i} = \dfrac{\partial L}{\partial y_i} \dfrac{\partial y_i}{\partial \hat{x}_i} = \dfrac{\partial L}{\partial y_i} \gamma \label{eq:17-6-6}\end{equation}

[Step 2: Gradient with respect to $\sigma^2$]

From \eqref{eq:17-6-3}, taking the partial derivative of $\hat{x}_n$ with respect to $\sigma^2$: since $\sqrt{\sigma^2 + \epsilon} = (\sigma^2 + \epsilon)^{1/2}$:

\begin{equation}\dfrac{\partial \hat{x}_n}{\partial \sigma^2} = (x_n - \mu) \cdot \left(-\dfrac{1}{2}\right)(\sigma^2 + \epsilon)^{-3/2} \label{eq:17-6-7}\end{equation}

Applying the chain rule (1.26) over all $n$ and taking the sum:

\begin{equation}\dfrac{\partial L}{\partial \sigma^2} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \dfrac{\partial \hat{x}_n}{\partial \sigma^2} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} (x_n - \mu) \cdot \left(-\dfrac{1}{2}\right)(\sigma^2 + \epsilon)^{-3/2} \label{eq:17-6-8}\end{equation}

[Step 3: Gradient with respect to $\mu$]

$\mu$ affects $\hat{x}_n$ directly and also indirectly through $\sigma^2$. First, computing the direct effect on $\hat{x}_n$: from \eqref{eq:17-6-3}:

\begin{equation}\dfrac{\partial \hat{x}_n}{\partial \mu} = \dfrac{-1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-9}\end{equation}

Next, computing the effect through $\sigma^2$: from \eqref{eq:17-6-2}:

\begin{equation}\dfrac{\partial \sigma^2}{\partial \mu} = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} 2(x_n - \mu)(-1) = \dfrac{-2}{N}\displaystyle\sum_{n=0}^{N-1}(x_n - \mu) \label{eq:17-6-10}\end{equation}

Since $\displaystyle\sum_n (x_n - \mu) = \displaystyle\sum_n x_n - N\mu = N\mu - N\mu = 0$, \eqref{eq:17-6-10} equals 0.

\begin{equation}\dfrac{\partial \sigma^2}{\partial \mu} = 0 \label{eq:17-6-11}\end{equation}

Therefore, the gradient with respect to $\mu$ comes only from the direct path.

\begin{equation}\dfrac{\partial L}{\partial \mu} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \dfrac{\partial \hat{x}_n}{\partial \mu} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \cdot \dfrac{-1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-12}\end{equation}

[Step 4: Gradient with respect to $x_i$]

$x_i$ affects $\hat{x}_i$, $\mu$, and $\sigma^2$. We compute the contribution from each path.

Direct effect on $\hat{x}_i$: from \eqref{eq:17-6-3}:

\begin{equation}\dfrac{\partial \hat{x}_i}{\partial x_i} = \dfrac{1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-13}\end{equation}

Effect on $\mu$: from \eqref{eq:17-6-1}:

\begin{equation}\dfrac{\partial \mu}{\partial x_i} = \dfrac{1}{N} \label{eq:17-6-14}\end{equation}

Effect on $\sigma^2$: from \eqref{eq:17-6-2}:

\begin{equation}\dfrac{\partial \sigma^2}{\partial x_i} = \dfrac{1}{N} \cdot 2(x_i - \mu) = \dfrac{2(x_i - \mu)}{N} \label{eq:17-6-15}\end{equation}

Summing all paths using the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\partial L}{\partial \hat{x}_i} \dfrac{\partial \hat{x}_i}{\partial x_i} + \dfrac{\partial L}{\partial \mu} \dfrac{\partial \mu}{\partial x_i} + \dfrac{\partial L}{\partial \sigma^2} \dfrac{\partial \sigma^2}{\partial x_i} \label{eq:17-6-16}\end{equation}

Substituting \eqref{eq:17-6-6}, \eqref{eq:17-6-12}, \eqref{eq:17-6-8}, \eqref{eq:17-6-13}, \eqref{eq:17-6-14}, \eqref{eq:17-6-15} into \eqref{eq:17-6-16}:

\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\partial L}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}} + \dfrac{\partial L}{\partial \mu} \cdot \dfrac{1}{N} + \dfrac{\partial L}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{N} \label{eq:17-6-17}\end{equation}

Simplifying \eqref{eq:17-6-17}: from \eqref{eq:17-6-6}, $\displaystyle\dfrac{\partial L}{\partial \hat{x}_i} = \gamma \displaystyle\dfrac{\partial L}{\partial y_i}$. Also, from $\hat{x}_i = \displaystyle\dfrac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$, we have $(x_i - \mu) = \hat{x}_i \sqrt{\sigma^2 + \epsilon}$. Substituting these and simplifying gives the final result.

\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( \dfrac{\partial L}{\partial y_i} - \dfrac{1}{N}\displaystyle\sum_{j=0}^{N-1} \dfrac{\partial L}{\partial y_j} - \dfrac{\hat{x}_i}{N}\displaystyle\sum_{j=0}^{N-1} \dfrac{\partial L}{\partial y_j}\hat{x}_j \right) \label{eq:17-6-18}\end{equation}

Note: The three terms in \eqref{eq:17-6-18} represent contributions from the direct path, the mean path, and the variance path, respectively. This complex gradient accounts for the indirect dependency through the batch statistics $\mu$ and $\sigma^2$, not just the normalized value $\hat{x}_i$.

17.7 Input Gradient of Layer Normalization

Formula: $\displaystyle\dfrac{\partial L}{\partial x_i} = \displaystyle\dfrac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( \displaystyle\dfrac{\partial L}{\partial y_i} - \displaystyle\dfrac{1}{D}\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial y_j} - \displaystyle\dfrac{\hat{x}_i}{D}\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial y_j}\hat{x}_j \right)$

Conditions: $\mu$, $\sigma^2$ are within-sample statistics (mean and variance over the feature dimension), $D$ is the feature dimension

Proof

In layer normalization, the statistics are computed over the feature dimension rather than the batch dimension. The computation within a single sample is as follows.

\begin{equation}\mu = \dfrac{1}{D}\displaystyle\sum_{d=0}^{D-1} x_d \label{eq:17-7-1}\end{equation}

\begin{equation}\sigma^2 = \dfrac{1}{D}\displaystyle\sum_{d=0}^{D-1} (x_d - \mu)^2 \label{eq:17-7-2}\end{equation}

\begin{equation}\hat{x}_d = \dfrac{x_d - \mu}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-7-3}\end{equation}

\begin{equation}y_d = \gamma_d \hat{x}_d + \beta_d \label{eq:17-7-4}\end{equation}

The derivation follows the same pattern as 17.6. The only difference from batch normalization is the direction over which statistics are computed.

In the derivation from \eqref{eq:17-6-1} through \eqref{eq:17-6-18} of 17.6, we replace the batch size $N$ with the feature dimension $D$. Each step of the computation is identical, and the final result is:

\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\gamma_i}{\sqrt{\sigma^2 + \epsilon}} \left( \dfrac{\partial L}{\partial y_i} - \dfrac{1}{D}\displaystyle\sum_{j=0}^{D-1} \dfrac{\partial L}{\partial y_j} - \dfrac{\hat{x}_i}{D}\displaystyle\sum_{j=0}^{D-1} \dfrac{\partial L}{\partial y_j}\hat{x}_j \right) \label{eq:17-7-5}\end{equation}

Note: Layer Normalization is standard in Transformers. In \eqref{eq:17-7-5}, note that the scale parameter $\gamma_i$ takes a different value for each feature dimension, unlike batch normalization. Since it does not depend on batch size, it is well suited for processing variable-length sequences.

17.8 Value Gradient of Attention

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{V}} = \boldsymbol{A}^\top \displaystyle\dfrac{\partial L}{\partial \boldsymbol{O}}$

Conditions: $\boldsymbol{O} = \boldsymbol{A}\boldsymbol{V}$, $\boldsymbol{A} = \text{softmax}(\boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}) \in \mathbb{R}^{n \times n}$, $\boldsymbol{V} \in \mathbb{R}^{n \times d_v}$

Proof

The attention output is defined as the matrix product of attention weights $\boldsymbol{A}$ and values $\boldsymbol{V}$.

\begin{equation}\boldsymbol{O} = \boldsymbol{A}\boldsymbol{V} \label{eq:17-8-1}\end{equation}

Writing \eqref{eq:17-8-1} in component form, the $(i, j)$ entry of the output is:

\begin{equation}O_{ij} = \displaystyle\sum_{k=0}^{n-1} A_{ik} V_{kj} \label{eq:17-8-2}\end{equation}

In \eqref{eq:17-8-2}, $O_{ij}$ depends on $V_{pq}$ only when $k = p$ and $j = q$. Computing the partial derivative:

\begin{equation}\dfrac{\partial O_{ij}}{\partial V_{pq}} = A_{ip} \delta_{jq} \label{eq:17-8-3}\end{equation}

Here $\delta_{jq}$ is the Kronecker delta, equal to 1 when $j = q$ and 0 otherwise.

The loss $L$ depends on $\boldsymbol{V}$ through the output $\boldsymbol{O}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \dfrac{\partial O_{ij}}{\partial V_{pq}} \label{eq:17-8-4}\end{equation}

Substituting \eqref{eq:17-8-3} into \eqref{eq:17-8-4}: $\delta_{jq}$ retains only the $j = q$ terms.

\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} A_{ip} \delta_{jq} = \displaystyle\sum_i A_{ip} \dfrac{\partial L}{\partial O_{iq}} \label{eq:17-8-5}\end{equation}

Interpreting \eqref{eq:17-8-5} as a matrix product entry: since $(\boldsymbol{A}^\top)_{pi} = A_{ip}$:

\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_i (\boldsymbol{A}^\top)_{pi} \left(\dfrac{\partial L}{\partial \boldsymbol{O}}\right)_{iq} = \left( \boldsymbol{A}^\top \dfrac{\partial L}{\partial \boldsymbol{O}} \right)_{pq} \label{eq:17-8-6}\end{equation}

Since \eqref{eq:17-8-6} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{V}} = \boldsymbol{A}^\top \dfrac{\partial L}{\partial \boldsymbol{O}} \label{eq:17-8-7}\end{equation}

Note: \eqref{eq:17-8-7} has the same structure as the weight gradient in 17.1. The transpose of the attention weights $\boldsymbol{A}$ plays the role of the input $\boldsymbol{X}$.

17.9 Gradient of Attention Weights

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}} = \displaystyle\dfrac{\partial L}{\partial \boldsymbol{O}} \boldsymbol{V}^\top$

Conditions: $\boldsymbol{O} = \boldsymbol{A}\boldsymbol{V}$

Proof

From \eqref{eq:17-8-2} of 17.8, the component representation of the attention output is:

\begin{equation}O_{ij} = \displaystyle\sum_{k=0}^{n-1} A_{ik} V_{kj} \label{eq:17-9-1}\end{equation}

In \eqref{eq:17-9-1}, $O_{ij}$ depends on $A_{pq}$ only when $i = p$ and $k = q$. Computing the partial derivative:

\begin{equation}\dfrac{\partial O_{ij}}{\partial A_{pq}} = \delta_{ip} V_{qj} \label{eq:17-9-2}\end{equation}

Here $\delta_{ip}$ is the Kronecker delta.

The loss $L$ depends on $\boldsymbol{A}$ through the output $\boldsymbol{O}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \dfrac{\partial O_{ij}}{\partial A_{pq}} \label{eq:17-9-3}\end{equation}

Substituting \eqref{eq:17-9-2} into \eqref{eq:17-9-3}: $\delta_{ip}$ retains only the $i = p$ terms.

\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \delta_{ip} V_{qj} = \displaystyle\sum_j \dfrac{\partial L}{\partial O_{pj}} V_{qj} \label{eq:17-9-4}\end{equation}

Interpreting \eqref{eq:17-9-4} as a matrix product entry: since $(\boldsymbol{V}^\top)_{qj} = V_{qj}$:

\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{O}}\right)_{pj} (\boldsymbol{V}^\top)_{qj} = \left( \dfrac{\partial L}{\partial \boldsymbol{O}} \boldsymbol{V}^\top \right)_{pq} \label{eq:17-9-5}\end{equation}

Since \eqref{eq:17-9-5} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{A}} = \dfrac{\partial L}{\partial \boldsymbol{O}} \boldsymbol{V}^\top \label{eq:17-9-6}\end{equation}

Note: \eqref{eq:17-9-6} has the same structure as the input gradient in 17.3. This gradient is backpropagated to the Query and Key through the softmax.

17.10 Score Gradient (pre-softmax)

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}} = \boldsymbol{A} \odot \left( \displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}} - \text{rowsum}\left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}} \odot \boldsymbol{A}\right) \boldsymbol{1}^\top \right)$

Conditions: $\boldsymbol{A} = \text{softmax}(\boldsymbol{S})$ (softmax applied row-wise), $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}$

Proof

The attention weights $\boldsymbol{A}$ are computed as the row-wise softmax of the scores $\boldsymbol{S}$.

\begin{equation}\boldsymbol{A} = \text{softmax}(\boldsymbol{S}) \label{eq:17-10-1}\end{equation}

From the definition of softmax, the entry $A_{ij}$ of row $i$ is:

\begin{equation}A_{ij} = \dfrac{e^{S_{ij}}}{\displaystyle\sum_{l=0}^{n-1} e^{S_{il}}} \label{eq:17-10-2}\end{equation}

We compute the Jacobian of the softmax. Taking the partial derivative of $A_{ij}$ with respect to $S_{ik}$, separating the cases $j = k$ and $j \neq k$:

When $j = k$: applying the quotient rule (1.28):

\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ij}} = \dfrac{e^{S_{ij}} \displaystyle\sum_l e^{S_{il}} - e^{S_{ij}} e^{S_{ij}}}{(\displaystyle\sum_l e^{S_{il}})^2} = A_{ij} - A_{ij}^2 = A_{ij}(1 - A_{ij}) \label{eq:17-10-3}\end{equation}

When $j \neq k$:

\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ik}} = \dfrac{0 \cdot \displaystyle\sum_l e^{S_{il}} - e^{S_{ij}} e^{S_{ik}}}{(\displaystyle\sum_l e^{S_{il}})^2} = -A_{ij} A_{ik} \label{eq:17-10-4}\end{equation}

Unifying \eqref{eq:17-10-3} and \eqref{eq:17-10-4} using the Kronecker delta:

\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ik}} = A_{ij}(\delta_{jk} - A_{ik}) \label{eq:17-10-5}\end{equation}

The loss $L$ depends on $\boldsymbol{S}$ through $\boldsymbol{A}$. Applying the chain rule (1.26) for row $i$:

\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \displaystyle\sum_{j=0}^{n-1} \dfrac{\partial L}{\partial A_{ij}} \dfrac{\partial A_{ij}}{\partial S_{ik}} \label{eq:17-10-6}\end{equation}

Substituting \eqref{eq:17-10-5} into \eqref{eq:17-10-6}:

\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij}(\delta_{jk} - A_{ik}) \label{eq:17-10-7}\end{equation}

Expanding \eqref{eq:17-10-7}: splitting into the first term where $\delta_{jk}$ retains only $j = k$, and the second term which is a sum over all $j$:

\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \dfrac{\partial L}{\partial A_{ik}} A_{ik} - A_{ik} \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij} \label{eq:17-10-8}\end{equation}

Factoring out $A_{ik}$ from \eqref{eq:17-10-8}:

\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = A_{ik} \left( \dfrac{\partial L}{\partial A_{ik}} - \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij} \right) \label{eq:17-10-9}\end{equation}

Writing \eqref{eq:17-10-9} in matrix form: $\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial A_{ij}} A_{ij}$ is the row $i$ sum of $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}} \odot \boldsymbol{A}$, denoted $\text{rowsum}(\cdot)$.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{S}} = \boldsymbol{A} \odot \left( \dfrac{\partial L}{\partial \boldsymbol{A}} - \text{rowsum}\left(\dfrac{\partial L}{\partial \boldsymbol{A}} \odot \boldsymbol{A}\right) \boldsymbol{1}^\top \right) \label{eq:17-10-10}\end{equation}

Note: In \eqref{eq:17-10-10}, $\boldsymbol{1}^\top$ is a row vector, and $\text{rowsum}(\cdot) \boldsymbol{1}^\top$ broadcasts each row sum to all columns. This formula is used in efficient implementations such as FlashAttention.

17.11 Query Gradient

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Q}} = \displaystyle\dfrac{1}{\sqrt{d_k}} \displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}} \boldsymbol{K}$

Conditions: $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}$

Proof

The attention score $\boldsymbol{S}$ is the scaled inner product of Query and Key.

\begin{equation}\boldsymbol{S} = \dfrac{1}{\sqrt{d_k}} \boldsymbol{Q}\boldsymbol{K}^\top \label{eq:17-11-1}\end{equation}

Writing \eqref{eq:17-11-1} in component form, the $(i, j)$ entry of the score is:

\begin{equation}S_{ij} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{k=0}^{d_k-1} Q_{ik} K_{jk} \label{eq:17-11-2}\end{equation}

In \eqref{eq:17-11-2}, $S_{ij}$ depends on $Q_{pq}$ only when $i = p$ and $k = q$. Computing the partial derivative:

\begin{equation}\dfrac{\partial S_{ij}}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \delta_{ip} K_{jq} \label{eq:17-11-3}\end{equation}

The loss $L$ depends on $\boldsymbol{Q}$ through the scores $\boldsymbol{S}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \dfrac{\partial S_{ij}}{\partial Q_{pq}} \label{eq:17-11-4}\end{equation}

Substituting \eqref{eq:17-11-3} into \eqref{eq:17-11-4}: $\delta_{ip}$ retains only the $i = p$ terms.

\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \delta_{ip} K_{jq} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_j \dfrac{\partial L}{\partial S_{pj}} K_{jq} \label{eq:17-11-5}\end{equation}

Interpreting \eqref{eq:17-11-5} as a matrix product entry:

\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)_{pj} K_{jq} = \dfrac{1}{\sqrt{d_k}} \left( \dfrac{\partial L}{\partial \boldsymbol{S}} \boldsymbol{K} \right)_{pq} \label{eq:17-11-6}\end{equation}

Since \eqref{eq:17-11-6} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{Q}} = \dfrac{1}{\sqrt{d_k}} \dfrac{\partial L}{\partial \boldsymbol{S}} \boldsymbol{K} \label{eq:17-11-7}\end{equation}

Note: In \eqref{eq:17-11-7}, the scaling factor $\displaystyle\dfrac{1}{\sqrt{d_k}}$ appears in the backward pass just as in the forward pass.

17.12 Key Gradient

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{K}} = \displaystyle\dfrac{1}{\sqrt{d_k}} \left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top \boldsymbol{Q}$

Conditions: $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top / \sqrt{d_k}$

Proof

From \eqref{eq:17-11-2} of 17.11, the component representation of the score is:

\begin{equation}S_{ij} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{k=0}^{d_k-1} Q_{ik} K_{jk} \label{eq:17-12-1}\end{equation}

In \eqref{eq:17-12-1}, $S_{ij}$ depends on $K_{pq}$ only when $j = p$ and $k = q$. Computing the partial derivative:

\begin{equation}\dfrac{\partial S_{ij}}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} Q_{iq} \delta_{jp} \label{eq:17-12-2}\end{equation}

The loss $L$ depends on $\boldsymbol{K}$ through the scores $\boldsymbol{S}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \dfrac{\partial S_{ij}}{\partial K_{pq}} \label{eq:17-12-3}\end{equation}

Substituting \eqref{eq:17-12-2} into \eqref{eq:17-12-3}: $\delta_{jp}$ retains only the $j = p$ terms.

\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} Q_{iq} \delta_{jp} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_i \dfrac{\partial L}{\partial S_{ip}} Q_{iq} \label{eq:17-12-4}\end{equation}

Interpreting \eqref{eq:17-12-4} as a matrix product entry: since $\left(\left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top\right)_{pi} = \left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}}\right)_{ip}$:

\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_i \left(\left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top\right)_{pi} Q_{iq} = \dfrac{1}{\sqrt{d_k}} \left( \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top \boldsymbol{Q} \right)_{pq} \label{eq:17-12-5}\end{equation}

Since \eqref{eq:17-12-5} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{K}} = \dfrac{1}{\sqrt{d_k}} \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top \boldsymbol{Q} \label{eq:17-12-6}\end{equation}

Note: Comparing \eqref{eq:17-12-6} with \eqref{eq:17-11-7}, we see that the gradients of Query and Key have a symmetric structure. This reflects the fact that Query and Key are related by transposition in $\boldsymbol{S} = \boldsymbol{Q}\boldsymbol{K}^\top$.

17.13 Filter Gradient of the Convolutional Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{F}} = \boldsymbol{X} \star \displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$

Conditions: $\boldsymbol{Y} = \boldsymbol{X} * \boldsymbol{F}$ (convolution), $\star$ denotes cross-correlation

Proof

Consider a 2D convolution. The output $\boldsymbol{Y}$ is computed by convolving the input $\boldsymbol{X}$ with the filter $\boldsymbol{F}$.

\begin{equation}\boldsymbol{Y} = \boldsymbol{X} * \boldsymbol{F} \label{eq:17-13-1}\end{equation}

Writing \eqref{eq:17-13-1} in component form with filter size $H \times W$:

\begin{equation}Y_{ij} = \displaystyle\sum_{m=0}^{H-1} \displaystyle\sum_{n=0}^{W-1} X_{i+m, j+n} F_{mn} \label{eq:17-13-2}\end{equation}

In \eqref{eq:17-13-2}, $Y_{ij}$ depends on $F_{pq}$ only when $m = p$ and $n = q$. Computing the partial derivative:

\begin{equation}\dfrac{\partial Y_{ij}}{\partial F_{pq}} = X_{i+p, j+q} \label{eq:17-13-3}\end{equation}

The loss $L$ depends on $\boldsymbol{F}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial F_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} \dfrac{\partial Y_{ij}}{\partial F_{pq}} \label{eq:17-13-4}\end{equation}

Substituting \eqref{eq:17-13-3} into \eqref{eq:17-13-4}:

\begin{equation}\dfrac{\partial L}{\partial F_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} X_{i+p, j+q} \label{eq:17-13-5}\end{equation}

\eqref{eq:17-13-5} is the cross-correlation of the input $\boldsymbol{X}$ and the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{F}} = \boldsymbol{X} \star \dfrac{\partial L}{\partial \boldsymbol{Y}} \label{eq:17-13-6}\end{equation}

Note: In \eqref{eq:17-13-6}, the cross-correlation $\star$ differs from convolution $*$ in that it does not flip the filter. Deep learning frameworks typically use cross-correlation in the forward pass.

17.14 Input Gradient of the Convolutional Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{X}} = \displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}} *_{\text{full}} \text{rot}_{180}(\boldsymbol{F})$

Conditions: $*_{\text{full}}$ is a "full" convolution (with zero padding), $\text{rot}_{180}$ is a 180-degree rotation

Proof

From \eqref{eq:17-13-2} of 17.13, the component representation of the convolution is:

\begin{equation}Y_{ij} = \displaystyle\sum_{m=0}^{H-1} \displaystyle\sum_{n=0}^{W-1} X_{i+m, j+n} F_{mn} \label{eq:17-14-1}\end{equation}

In \eqref{eq:17-14-1}, we consider the condition under which the input element $X_{pq}$ contributes to the output $Y_{ij}$. $X_{pq} = X_{i+m, j+n}$ when $i = p - m$ and $j = q - n$.

For the output indices $(i, j)$ to be in a valid range, we need $0 \leq p - m$ and $0 \leq q - n$. That is, $X_{pq}$ may contribute to multiple output elements.

Computing the partial derivative: when $X_{pq}$ contributes to $Y_{ij}$:

\begin{equation}\dfrac{\partial Y_{ij}}{\partial X_{pq}} = F_{p-i, q-j} \label{eq:17-14-2}\end{equation}

provided that $(p-i, q-j)$ is within the valid range $[0, H-1] \times [0, W-1]$ of the filter.

The loss $L$ depends on $\boldsymbol{X}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} \dfrac{\partial Y_{ij}}{\partial X_{pq}} \label{eq:17-14-3}\end{equation}

Substituting \eqref{eq:17-14-2} into \eqref{eq:17-14-3}: the sum is over all $(i, j)$ to which $X_{pq}$ contributes.

\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{i,j: (p-i, q-j) \in [0, H-1] \times [0, W-1]} \dfrac{\partial L}{\partial Y_{ij}} F_{p-i, q-j} \label{eq:17-14-4}\end{equation}

Interpreting \eqref{eq:17-14-4}: performing the change of variables $m' = p - i$, $n' = q - j$, we get $i = p - m'$, $j = q - n'$, and the sum is over $m' \in [0, H-1]$, $n' \in [0, W-1]$.

\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{m'=0}^{H-1} \displaystyle\sum_{n'=0}^{W-1} \dfrac{\partial L}{\partial Y_{p-m', q-n'}} F_{m', n'} \label{eq:17-14-5}\end{equation}

\eqref{eq:17-14-5} corresponds to a "full" convolution of the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$ with the 180-degree rotated filter $\text{rot}_{180}(\boldsymbol{F})$.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{X}} = \dfrac{\partial L}{\partial \boldsymbol{Y}} *_{\text{full}} \text{rot}_{180}(\boldsymbol{F}) \label{eq:17-14-6}\end{equation}

Note: In \eqref{eq:17-14-6}, the 180-degree rotation of the filter is defined as $(\text{rot}_{180}(\boldsymbol{F}))_{mn} = F_{H-1-m, W-1-n}$. A "full" convolution pads with zeros so that the output is larger than the input.

17.15 Gradient of Max Pooling

Formula: $\displaystyle\dfrac{\partial L}{\partial X_{ij}} = \displaystyle\dfrac{\partial L}{\partial Y_k} \cdot \mathbf{1}_{X_{ij} = \max}$

Conditions: $Y_k = \max_{(i,j) \in \text{pool}_k} X_{ij}$

Proof

Max pooling selects the maximum value within each pooling region.

\begin{equation}Y_k = \max_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-15-1}\end{equation}

In \eqref{eq:17-15-1}, let $(i^*, j^*)$ be the position that achieves the maximum.

\begin{equation}(i^*, j^*) = \arg\max_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-15-2}\end{equation}

Consider the derivative of the max function. $Y_k$ depends only on the element $X_{i^*, j^*}$ that achieves the maximum, and does not depend on other elements (locally). The partial derivative is therefore:

\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \begin{cases} 1 & \text{if } (i, j) = (i^*, j^*) \\ 0 & \text{otherwise} \end{cases} \label{eq:17-15-3}\end{equation}

\eqref{eq:17-15-3} can be written using the indicator function $\mathbf{1}_{X_{ij} = \max}$:

\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \mathbf{1}_{X_{ij} = \max\{X_{pq}: (p,q) \in \text{pool}_k\}} \label{eq:17-15-4}\end{equation}

The loss $L$ depends on $X_{ij}$ through the output $Y_k$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \dfrac{\partial Y_k}{\partial X_{ij}} \label{eq:17-15-5}\end{equation}

Substituting \eqref{eq:17-15-4} into \eqref{eq:17-15-5} gives the final result.

\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \cdot \mathbf{1}_{X_{ij} = \max} \label{eq:17-15-6}\end{equation}

Note: From \eqref{eq:17-15-6}, the gradient propagates only to the position of the maximum value and not to other positions. When multiple elements share the same maximum value, the behavior depends on the implementation (only the first element, or distributed equally).

17.16 Gradient of Average Pooling

Formula: $\displaystyle\dfrac{\partial L}{\partial X_{ij}} = \displaystyle\dfrac{1}{|\text{pool}|} \displaystyle\dfrac{\partial L}{\partial Y_k}$

Conditions: $Y_k = \displaystyle\dfrac{1}{|\text{pool}_k|} \displaystyle\sum_{(i,j) \in \text{pool}_k} X_{ij}$, $(i,j) \in \text{pool}_k$

Proof

Average pooling computes the mean value within each pooling region.

\begin{equation}Y_k = \dfrac{1}{|\text{pool}_k|} \displaystyle\sum_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-16-1}\end{equation}

Here $|\text{pool}_k|$ is the number of elements in the pooling region.

In \eqref{eq:17-16-1}, $Y_k$ depends equally on all elements within the pooling region. Computing the partial derivative:

\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \dfrac{1}{|\text{pool}_k|} \quad \text{for all } (i, j) \in \text{pool}_k \label{eq:17-16-2}\end{equation}

The loss $L$ depends on $X_{ij}$ through the output $Y_k$. Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \dfrac{\partial Y_k}{\partial X_{ij}} \label{eq:17-16-3}\end{equation}

Substituting \eqref{eq:17-16-2} into \eqref{eq:17-16-3} gives the final result.

\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{1}{|\text{pool}_k|} \dfrac{\partial L}{\partial Y_k} \label{eq:17-16-4}\end{equation}

Note: From \eqref{eq:17-16-4}, average pooling distributes the gradient equally to all elements in the pooling region. This contrasts with 17.15 max pooling: max pooling is winner-takes-all, while average pooling distributes equally.

17.17 Gradient of the Embedding Layer

Formula: $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{E}_{i,:}} = \displaystyle\sum_{n: \text{idx}_n = i} \displaystyle\dfrac{\partial L}{\partial \boldsymbol{o}_n}$

Conditions: $\boldsymbol{o}_n = \boldsymbol{E}_{\text{idx}_n, :}$ (lookup operation), $\boldsymbol{E} \in \mathbb{R}^{V \times d}$ is the embedding matrix

Proof

The embedding layer functions as a lookup table. It outputs the row of the embedding matrix corresponding to the input index $\text{idx}_n$.

\begin{equation}\boldsymbol{o}_n = \boldsymbol{E}_{\text{idx}_n, :} \label{eq:17-17-1}\end{equation}

Here $\boldsymbol{E} \in \mathbb{R}^{V \times d}$ is the embedding matrix with vocabulary size $V$ and embedding dimension $d$.

In \eqref{eq:17-17-1}, the output $\boldsymbol{o}_n$ is a copy of the $\text{idx}_n$-th row of the embedding matrix. Considering the partial derivative:

\begin{equation}\dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} = \begin{cases} 1 & \text{if } i = \text{idx}_n \text{ and } j = k \\ 0 & \text{otherwise} \end{cases} \label{eq:17-17-2}\end{equation}

\eqref{eq:17-17-2} can be written using the Kronecker delta:

\begin{equation}\dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} = \delta_{i, \text{idx}_n} \delta_{jk} \label{eq:17-17-3}\end{equation}

The loss $L$ depends on $\boldsymbol{E}$ through the outputs $\boldsymbol{o}_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):

\begin{equation}\dfrac{\partial L}{\partial E_{ik}} = \displaystyle\sum_n \displaystyle\sum_j \dfrac{\partial L}{\partial (\boldsymbol{o}_n)_j} \dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} \label{eq:17-17-4}\end{equation}

Substituting \eqref{eq:17-17-3} into \eqref{eq:17-17-4}: $\delta_{i, \text{idx}_n}$ retains only the terms with $\text{idx}_n = i$, and $\delta_{jk}$ retains only $j = k$.

\begin{equation}\dfrac{\partial L}{\partial E_{ik}} = \displaystyle\sum_{n: \text{idx}_n = i} \dfrac{\partial L}{\partial (\boldsymbol{o}_n)_k} \label{eq:17-17-5}\end{equation}

Writing \eqref{eq:17-17-5} in row vector form gives the final result.

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{E}_{i,:}} = \displaystyle\sum_{n: \text{idx}_n = i} \dfrac{\partial L}{\partial \boldsymbol{o}_n} \label{eq:17-17-6}\end{equation}

Note: From \eqref{eq:17-17-6}, the gradient of each row of the embedding matrix is the sum of the gradients from all positions that referenced that row. When the same index is referenced multiple times within a batch, the gradients accumulate. In implementation, this is efficiently handled as a sparse gradient.

17.18 Gradient of L2 Regularization

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{W}} \displaystyle\dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W}$

Conditions: $\|\boldsymbol{W}\|_F^2 = \displaystyle\sum_{ij} W_{ij}^2$ (squared Frobenius norm)

Proof

The L2 regularization term is the squared Frobenius norm of the weight matrix multiplied by a scaling coefficient.

\begin{equation}R = \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 \label{eq:17-18-1}\end{equation}

Here $\lambda > 0$ is a hyperparameter controlling the regularization strength.

Taking the partial derivative of \eqref{eq:17-18-1} with respect to $W_{pq}$: only the $(i, j) = (p, q)$ term in the sum contains $W_{pq}$.

\begin{equation}\dfrac{\partial R}{\partial W_{pq}} = \dfrac{\partial}{\partial W_{pq}} \left( \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 \right) = \dfrac{\lambda}{2} \cdot 2W_{pq} = \lambda W_{pq} \label{eq:17-18-2}\end{equation}

Since \eqref{eq:17-18-2} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{W}} \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W} \label{eq:17-18-3}\end{equation}

Note: From \eqref{eq:17-18-3}, the gradient of the total loss with L2 regularization (weight decay) is $\displaystyle\dfrac{\partial L_{\text{total}}}{\partial \boldsymbol{W}} = \displaystyle\dfrac{\partial L}{\partial \boldsymbol{W}} + \lambda \boldsymbol{W}$. This has the effect of pulling weights toward the origin, suppressing overfitting.
For a more concise derivation, see 12.10.

17.19 Subgradient of L1 Regularization

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \text{sign}(\boldsymbol{W})$

Conditions: $\|\boldsymbol{W}\|_1 = \displaystyle\sum_{ij} |W_{ij}|$, $\text{sign}(x) = \begin{cases} 1 & x > 0 \\ 0 & x = 0 \\ -1 & x < 0 \end{cases}$ (at $x = 0$, any value in $[-1, 1]$ is a subgradient)

Proof

The L1 regularization term is the L1 norm of the weight matrix multiplied by a scaling coefficient.

\begin{equation}R = \lambda\|\boldsymbol{W}\|_1 = \lambda \displaystyle\sum_{i,j} |W_{ij}| \label{eq:17-19-1}\end{equation}

Consider the derivative of the absolute value function $|x|$, separating the cases $x > 0$ and $x < 0$.

\begin{equation}\dfrac{d|x|}{dx} = \begin{cases} 1 & x > 0 \\ -1 & x < 0 \end{cases} \label{eq:17-19-2}\end{equation}

At $x = 0$, the absolute value function is not differentiable. However, using the concept of subgradients, the subgradient at $x = 0$ can take any value in the interval $[-1, 1]$.

\begin{equation}\partial |x| \Big|_{x=0} = [-1, 1] \label{eq:17-19-3}\end{equation}

Unifying \eqref{eq:17-19-2} and \eqref{eq:17-19-3} using the sign function $\text{sign}(x)$: in practice, $\text{sign}(0) = 0$ is commonly used.

\begin{equation}\dfrac{d|x|}{dx} = \text{sign}(x) = \begin{cases} 1 & x > 0 \\ 0 & x = 0 \\ -1 & x < 0 \end{cases} \label{eq:17-19-4}\end{equation}

Taking the partial derivative of \eqref{eq:17-19-1} with respect to $W_{pq}$: only the $(i, j) = (p, q)$ term in the sum contains $|W_{pq}|$.

\begin{equation}\dfrac{\partial R}{\partial W_{pq}} = \lambda \dfrac{\partial |W_{pq}|}{\partial W_{pq}} = \lambda \cdot \text{sign}(W_{pq}) \label{eq:17-19-5}\end{equation}

Since \eqref{eq:17-19-5} holds for all $(p, q)$, we obtain the final result in matrix form.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \text{sign}(\boldsymbol{W}) \label{eq:17-19-6}\end{equation}

Note: From \eqref{eq:17-19-6}, L1 regularization promotes sparsity. As $W_{ij}$ approaches 0, the sign of the gradient changes, causing a "sticking" effect exactly at 0. This makes it likely for many weights to become exactly zero.
For a more concise derivation, see 12.11.

17.20 KL Divergence for Gaussian Distributions

Formula: $D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \| \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})) = \displaystyle\dfrac{1}{2}\displaystyle\sum_i (\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2)$

Conditions: $\mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2))$ is a multivariate normal distribution with diagonal covariance

Proof

The definition of KL divergence is:

\begin{equation}D_{\text{KL}}(p \| q) = \mathbb{E}_p[\log p - \log q] = \displaystyle\int p(x) \log \dfrac{p(x)}{q(x)} dx \label{eq:17-20-1}\end{equation}

First, consider the one-dimensional case. Let $p = \mathcal{N}(\mu, \sigma^2)$ and $q = \mathcal{N}(0, 1)$.

Writing the log of the normal distributions:

\begin{equation}\log p(x) = -\dfrac{1}{2}\log(2\pi\sigma^2) - \dfrac{(x-\mu)^2}{2\sigma^2} \label{eq:17-20-2}\end{equation}

\begin{equation}\log q(x) = -\dfrac{1}{2}\log(2\pi) - \dfrac{x^2}{2} \label{eq:17-20-3}\end{equation}

Computing the difference of \eqref{eq:17-20-2} and \eqref{eq:17-20-3}:

\begin{equation}\log p(x) - \log q(x) = -\dfrac{1}{2}\log\sigma^2 - \dfrac{(x-\mu)^2}{2\sigma^2} + \dfrac{x^2}{2} \label{eq:17-20-4}\end{equation}

Computing the expectation of \eqref{eq:17-20-4} under $p$, using the properties of expectations:

\begin{equation}\mathbb{E}_p[(x-\mu)^2] = \sigma^2 \label{eq:17-20-5}\end{equation}

\begin{equation}\mathbb{E}_p[x^2] = \text{Var}(x) + (\mathbb{E}_p[x])^2 = \sigma^2 + \mu^2 \label{eq:17-20-6}\end{equation}

Substituting \eqref{eq:17-20-5} and \eqref{eq:17-20-6} into the expectation of \eqref{eq:17-20-4}:

\begin{equation}D_{\text{KL}} = \mathbb{E}_p[\log p - \log q] = -\dfrac{1}{2}\log\sigma^2 - \dfrac{\sigma^2}{2\sigma^2} + \dfrac{\sigma^2 + \mu^2}{2} \label{eq:17-20-7}\end{equation}

Simplifying \eqref{eq:17-20-7}:

\begin{equation}D_{\text{KL}} = -\dfrac{1}{2}\log\sigma^2 - \dfrac{1}{2} + \dfrac{\sigma^2 + \mu^2}{2} = \dfrac{1}{2}(\mu^2 + \sigma^2 - 1 - \log\sigma^2) \label{eq:17-20-8}\end{equation}

In the multi-dimensional case with independent dimensions, the KL divergence is the sum over each dimension.

\begin{equation}D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \| \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})) = \dfrac{1}{2}\displaystyle\sum_{i=1}^{d} (\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2) \label{eq:17-20-9}\end{equation}

Note: \eqref{eq:17-20-9} is used as the regularization term in the loss function of VAE (Variational Autoencoder). It has the effect of pushing the distribution of latent variables toward the standard normal distribution.

17.21 Gradient of KL Divergence with Respect to the Mean

Formula: $\displaystyle\dfrac{\partial D_{\text{KL}}}{\partial \mu_i} = \mu_i$

Conditions: $D_{\text{KL}} = \displaystyle\dfrac{1}{2}\displaystyle\sum_i (\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2)$

Proof

From \eqref{eq:17-20-9} of 17.20, the KL divergence is:

\begin{equation}D_{\text{KL}} = \dfrac{1}{2}\displaystyle\sum_{j=1}^{d} (\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2) \label{eq:17-21-1}\end{equation}

In \eqref{eq:17-21-1}, the only term involving $\mu_i$ is $\displaystyle\dfrac{1}{2}\mu_i^2$. Terms for other dimensions $j \neq i$ do not depend on $\mu_i$.

Taking the partial derivative of \eqref{eq:17-21-1} with respect to $\mu_i$:

\begin{equation}\dfrac{\partial D_{\text{KL}}}{\partial \mu_i} = \dfrac{\partial}{\partial \mu_i} \dfrac{1}{2}\mu_i^2 = \dfrac{1}{2} \cdot 2\mu_i = \mu_i \label{eq:17-21-2}\end{equation}

Note: From \eqref{eq:17-21-2}, the gradient is 0 when $\mu_i = 0$, at which point the KL divergence is minimized with respect to $\mu_i$. This coincides with the mean 0 of the prior distribution $\mathcal{N}(0, 1)$.

17.22 Gradient of KL Divergence with Respect to the Standard Deviation

Formula: $\displaystyle\dfrac{\partial D_{\text{KL}}}{\partial \sigma_i} = \sigma_i - \displaystyle\dfrac{1}{\sigma_i}$

Conditions: $D_{\text{KL}} = \displaystyle\dfrac{1}{2}\displaystyle\sum_i (\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2)$

Proof

From \eqref{eq:17-21-1} of 17.21, the KL divergence is:

\begin{equation}D_{\text{KL}} = \dfrac{1}{2}\displaystyle\sum_{j=1}^{d} (\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2) \label{eq:17-22-1}\end{equation}

In \eqref{eq:17-22-1}, the terms involving $\sigma_i$ are $\displaystyle\dfrac{1}{2}(\sigma_i^2 - \log\sigma_i^2)$. Terms for other dimensions $j \neq i$ do not depend on $\sigma_i$.

Noting that $\log\sigma_i^2 = 2\log\sigma_i$, we take the partial derivative with respect to $\sigma_i$:

\begin{equation}\dfrac{\partial}{\partial \sigma_i} \sigma_i^2 = 2\sigma_i \label{eq:17-22-2}\end{equation}

\begin{equation}\dfrac{\partial}{\partial \sigma_i} \log\sigma_i^2 = \dfrac{\partial}{\partial \sigma_i} (2\log\sigma_i) = \dfrac{2}{\sigma_i} \label{eq:17-22-3}\end{equation}

Using \eqref{eq:17-22-2} and \eqref{eq:17-22-3} to take the partial derivative of $D_{\text{KL}}$ with respect to $\sigma_i$:

\begin{equation}\dfrac{\partial D_{\text{KL}}}{\partial \sigma_i} = \dfrac{1}{2}\left(2\sigma_i - \dfrac{2}{\sigma_i}\right) = \sigma_i - \dfrac{1}{\sigma_i} \label{eq:17-22-4}\end{equation}

Note: From \eqref{eq:17-22-4}, the gradient is $1 - 1 = 0$ when $\sigma_i = 1$, at which point the KL divergence is minimized with respect to $\sigma_i$. This coincides with the standard deviation 1 of the prior distribution $\mathcal{N}(0, 1)$. In VAE, $\log\sigma^2$ is often directly output for numerical stability, in which case the gradient becomes $\displaystyle\dfrac{1}{2}(e^{\log\sigma^2} - 1)$.

12.10 L2 Regularization (Weight Decay)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{W}} \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W}$

Conditions: $\boldsymbol{W} \in \mathbb{R}^{M \times N}$, $\lambda > 0$ is the regularization parameter

Proof

The squared Frobenius norm is $\|\boldsymbol{W}\|_F^2 = \mathrm{tr}(\boldsymbol{W}^\top \boldsymbol{W}) = \displaystyle\sum_{i,j} W_{ij}^2$.

\begin{equation} \dfrac{\partial}{\partial W_{kl}} \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 = \dfrac{\lambda}{2} \cdot 2 W_{kl} = \lambda W_{kl} \label{eq:12-10-1} \end{equation}

Collecting all entries in matrix form:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{W}} \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W} \label{eq:12-10-2} \end{equation}

Note: L2 regularization shrinks the weights by a factor of $1 - \eta\lambda$ at each gradient descent step, which is why it is also called weight decay.

12.11 L1 Regularization (Subgradient)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \mathrm{sign}(\boldsymbol{W})$

Conditions: $\|\boldsymbol{W}\|_1 = \displaystyle\sum_{i,j}|W_{ij}|$; subgradient $[-1, 1]$ at $W_{ij} = 0$

Proof

Since $\|\boldsymbol{W}\|_1 = \displaystyle\sum_{i,j} |W_{ij}|$, each component $|W_{kl}|$ can be differentiated independently of the others.

\begin{equation} \dfrac{\partial |W_{kl}|}{\partial W_{kl}} = \begin{cases} 1 & (W_{kl} > 0) \\ -1 & (W_{kl} < 0) \\ [-1, 1] & (W_{kl} = 0) \end{cases} = \mathrm{sign}(W_{kl}) \label{eq:12-11-1} \end{equation}

Here, when $W_{kl} = 0$, the absolute value function is not differentiable, but the subdifferential $\partial|W_{kl}| = [-1, 1]$ exists. In practice, the approximation $\mathrm{sign}(0) = 0$ is used.

Collecting all entries:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \mathrm{sign}(\boldsymbol{W}) \label{eq:12-11-2} \end{equation}

Note: L1 regularization promotes sparse solutions. Instead of approximating $\mathrm{sign}(0) = 0$, using the soft thresholding operator from the proximal gradient method is theoretically more appropriate.

12.12 LASSO Gradient (Regression with L1 Regularization)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{\alpha}}\left(\dfrac{1}{2}\|\boldsymbol{x} - \boldsymbol{D}\boldsymbol{\alpha}\|^2 + \lambda\|\boldsymbol{\alpha}\|_1\right) = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) + \lambda \cdot \mathrm{sign}(\boldsymbol{\alpha})$

Conditions: $\boldsymbol{x} \in \mathbb{R}^M$, $\boldsymbol{D} \in \mathbb{R}^{M \times N}$, $\boldsymbol{\alpha} \in \mathbb{R}^N$, $\lambda > 0$. Subgradient at components where $\alpha_i = 0$.

Proof

We decompose the objective function as $L = L_{\text{data}} + L_{\text{reg}}$.

\begin{equation} L_{\text{data}} = \dfrac{1}{2}\|\boldsymbol{x} - \boldsymbol{D}\boldsymbol{\alpha}\|^2, \qquad L_{\text{reg}} = \lambda\|\boldsymbol{\alpha}\|_1 \label{eq:12-12-1} \end{equation}

The gradient of the data term is computed similarly to 12.9. Letting $\boldsymbol{r} = \boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}$:

\begin{equation} \dfrac{\partial L_{\text{data}}}{\partial \boldsymbol{\alpha}} = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) \label{eq:12-12-2} \end{equation}

The subgradient of the regularization term follows from 12.11:

\begin{equation} \dfrac{\partial L_{\text{reg}}}{\partial \boldsymbol{\alpha}} = \lambda \cdot \mathrm{sign}(\boldsymbol{\alpha}) \label{eq:12-12-3} \end{equation}

Combining \eqref{eq:12-12-2} and \eqref{eq:12-12-3} gives the formula.

\begin{equation} \dfrac{\partial L}{\partial \boldsymbol{\alpha}} = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) + \lambda \cdot \mathrm{sign}(\boldsymbol{\alpha}) \label{eq:12-12-4} \end{equation}

Note: LASSO (Least Absolute Shrinkage and Selection Operator) is the foundation of sparse coding and compressed sensing. For handling the non-differentiable points, ISTA (Iterative Shrinkage-Thresholding Algorithm) and its accelerated version FISTA are used.

12.13 Gradient of Tikhonov Regularization

Formula: $\displaystyle\dfrac{\partial J}{\partial \boldsymbol{x}} = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) + 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x}$

Conditions: $J(\boldsymbol{x}) = \|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 + \lambda\|\boldsymbol{L}\boldsymbol{x}\|^2$

Proof

Computing the gradient of the first term $\|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 = (\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y})^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y})$:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}}\|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) \label{eq:12-13-1} \end{equation}

Computing the gradient of the second term $\lambda\|\boldsymbol{L}\boldsymbol{x}\|^2 = \lambda\boldsymbol{x}^\top\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x}$:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}}\lambda\|\boldsymbol{L}\boldsymbol{x}\|^2 = 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x} \label{eq:12-13-2} \end{equation}

Combining \eqref{eq:12-13-1} and \eqref{eq:12-13-2} gives the formula.

Note: In medical image reconstruction, $\boldsymbol{A}$ is the Radon transform (CT) or Fourier transform (MRI), and $\boldsymbol{L}$ is a differential operator, etc.

12.14 Tikhonov Regularization Solution

Formula: $\boldsymbol{x}^* = (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})^{-1}\boldsymbol{A}^\top\boldsymbol{y}$

Conditions: $\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L}$ is nonsingular

Proof

From the optimality condition $\nabla J = 0$:

\begin{equation} 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) + 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x} = \boldsymbol{0} \label{eq:12-14-1} \end{equation}

Rearranging \eqref{eq:12-14-1}:

\begin{equation} (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})\boldsymbol{x} = \boldsymbol{A}^\top\boldsymbol{y} \label{eq:12-14-2} \end{equation}

When $\lambda > 0$, $\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L}$ is positive definite (or positive semidefinite + positive definite) and hence nonsingular, giving the solution.

\begin{equation} \boldsymbol{x}^* = (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})^{-1}\boldsymbol{A}^\top\boldsymbol{y} \label{eq:12-14-3} \end{equation}

Note: When $\boldsymbol{L} = \boldsymbol{I}$, this reduces to standard Tikhonov regularization. Even when $\boldsymbol{A}$ is ill-conditioned, $\lambda$ stabilizes the solution.

12.15 Subgradient of Total Variation Regularization

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}}\text{TV}(\boldsymbol{x}) = -\text{div}\left(\displaystyle\dfrac{\nabla \boldsymbol{x}}{|\nabla \boldsymbol{x}|}\right)$

Conditions: $\text{TV}(\boldsymbol{x}) = \displaystyle\int |\nabla \boldsymbol{x}| \, d\Omega$ (continuous form)

Proof

The isotropic total variation is defined as:

\begin{equation} \text{TV}(\boldsymbol{x}) = \displaystyle\int_\Omega |\nabla \boldsymbol{x}| \, d\Omega = \displaystyle\int_\Omega \sqrt{|\partial_1 x|^2 + |\partial_2 x|^2} \, d\Omega \label{eq:12-15-1} \end{equation}

Using the calculus of variations, we perturb $\boldsymbol{x}$ by $\boldsymbol{x} + \varepsilon\boldsymbol{\phi}$, differentiate with respect to $\varepsilon$, and evaluate at $\varepsilon = 0$.

\begin{equation} \dfrac{d}{d\varepsilon}\text{TV}(\boldsymbol{x} + \varepsilon\boldsymbol{\phi})\bigg|_{\varepsilon=0} = \displaystyle\int_\Omega \dfrac{\nabla \boldsymbol{x} \cdot \nabla \boldsymbol{\phi}}{|\nabla \boldsymbol{x}|} \, d\Omega \label{eq:12-15-2} \end{equation}

Applying integration by parts (Green's theorem) and assuming the boundary terms vanish:

\begin{equation} = -\displaystyle\int_\Omega \text{div}\left(\dfrac{\nabla \boldsymbol{x}}{|\nabla \boldsymbol{x}|}\right) \boldsymbol{\phi} \, d\Omega \label{eq:12-15-3} \end{equation}

Therefore $\delta \text{TV} / \delta \boldsymbol{x} = -\text{div}(\nabla \boldsymbol{x} / |\nabla \boldsymbol{x}|)$.

Note: Non-differentiable where $|\nabla \boldsymbol{x}| = 0$. In practice, a smooth approximation $|\nabla \boldsymbol{x}|_\varepsilon = \sqrt{|\nabla \boldsymbol{x}|^2 + \varepsilon^2}$ is used.

Derivatives for Neural Networks

Activation and Loss Functions

6.2 Jacobian of softmax

Proof

6.3 Derivative of the Sigmoid Function

Proof

6.4 Derivative of the tanh Function

Proof

6.5 Derivative of the ReLU Function

Proof

6.6 Derivative of the Leaky ReLU Function

Proof

6.7 Derivative of Cross-Entropy Loss (softmax + CE)

Proof

6.8 Derivative of Binary Cross-Entropy Loss (sigmoid + BCE)

Proof

6.9 Derivative of the GELU Function

Proof

6.10 Derivative of the Swish (SiLU) Function

Proof

Fully Connected Layer, Batch Normalization, and Attention

17.1 Weight Gradient of the Fully Connected Layer

Proof

17.2 Bias Gradient of the Fully Connected Layer

Proof

17.3 Input Gradient of the Fully Connected Layer

Proof

17.4 Scale Gradient of Batch Normalization

Proof

17.5 Shift Gradient of Batch Normalization

Proof

17.6 Input Gradient of Batch Normalization

Proof

17.7 Input Gradient of Layer Normalization

Proof

17.8 Value Gradient of Attention

Proof

17.9 Gradient of Attention Weights

Proof

17.10 Score Gradient (pre-softmax)

Proof

17.11 Query Gradient

Proof

17.12 Key Gradient

Proof

17.13 Filter Gradient of the Convolutional Layer

Proof

17.14 Input Gradient of the Convolutional Layer

Proof

17.15 Gradient of Max Pooling

Proof

17.16 Gradient of Average Pooling

Proof

17.17 Gradient of the Embedding Layer

Proof

17.18 Gradient of L2 Regularization

Proof

17.19 Subgradient of L1 Regularization

Proof

17.20 KL Divergence for Gaussian Distributions

Proof

17.21 Gradient of KL Divergence with Respect to the Mean

Proof

17.22 Gradient of KL Divergence with Respect to the Standard Deviation

Proof

12.5 Derivatives of Regularization

12.10 L2 Regularization (Weight Decay)

Proof

12.11 L1 Regularization (Subgradient)

Proof

12.12 LASSO Gradient (Regression with L1 Regularization)

Proof

12.6 Application: Regularization for Image Reconstruction

12.13 Gradient of Tikhonov Regularization

Proof

12.14 Tikhonov Regularization Solution

Proof

12.15 Subgradient of Total Variation Regularization

Proof