Neural Networks
Backpropagation formulas for each layer in deep learning.
All formulas on this page use the denominator layout. See Layout conventions for details.
Derivatives for Neural Networks
We derive the backpropagation formulas for each layer in deep learning. The gradient of the loss $L$ with respect to each parameter is computed, and the parameters are updated via gradient descent.
Activation and Loss Functions
Derivative formulas for activation functions and loss functions in neural networks. The proof numbering from Chapter 6 is preserved.
6.2 Jacobian of softmax
($\boldsymbol{p} = \mathrm{softmax}(\boldsymbol{x})$)
Proof
The $i$-th output of the softmax function is defined as follows.
\begin{equation} p_i = \dfrac{e^{x_i}}{S}, \qquad S = \displaystyle\sum_{k=0}^{N-1} e^{x_k} \label{eq:6-2-1} \end{equation}
Case 1: $i = j$. Applying the quotient rule:
\begin{equation} \dfrac{\partial p_i}{\partial x_i} = \dfrac{e^{x_i} \cdot S - e^{x_i} \cdot e^{x_i}}{S^2} = \dfrac{e^{x_i}}{S} - \dfrac{e^{x_i}}{S} \cdot \dfrac{e^{x_i}}{S} = p_i - p_i^2 = p_i(1 - p_i) \label{eq:6-2-2} \end{equation}
Case 2: $i \neq j$. Since the numerator $e^{x_i}$ does not depend on $x_j$, only the denominator is differentiated.
\begin{equation} \dfrac{\partial p_i}{\partial x_j} = \dfrac{0 \cdot S - e^{x_i} \cdot e^{x_j}}{S^2} = -\dfrac{e^{x_i}}{S} \cdot \dfrac{e^{x_j}}{S} = -p_i \, p_j \label{eq:6-2-3} \end{equation}
Combining both cases using the Kronecker delta $\delta_{ij}$:
\begin{equation} \dfrac{\partial p_i}{\partial x_j} = p_i(\delta_{ij} - p_j) \label{eq:6-2-4} \end{equation}
Writing this in matrix form, since the $(i, j)$ entry of the Jacobian matrix is $p_i \delta_{ij} - p_i p_j$:
\begin{equation} \dfrac{\partial \boldsymbol{p}}{\partial \boldsymbol{x}} = \mathrm{diag}(\boldsymbol{p}) - \boldsymbol{p}\boldsymbol{p}^\top \label{eq:6-2-5} \end{equation}
6.3 Derivative of the Sigmoid Function
($\sigma(x) = 1 / (1 + e^{-x})$)
Proof
Since $\sigma$ is applied element-wise, the Jacobian matrix is diagonal. It suffices to compute the derivative of the component $\sigma(x_i)$.
\begin{equation} \sigma(x) = \dfrac{1}{1 + e^{-x}} = (1 + e^{-x})^{-1} \label{eq:6-3-1} \end{equation}
Applying the chain rule: let $u = 1 + e^{-x}$, so $\sigma = u^{-1}$:
\begin{equation} \dfrac{d\sigma}{dx} = -u^{-2} \cdot \dfrac{du}{dx} = -(1 + e^{-x})^{-2} \cdot (-e^{-x}) = \dfrac{e^{-x}}{(1 + e^{-x})^2} \label{eq:6-3-2} \end{equation}
Rewriting in terms of $\sigma(x)$: from $\sigma(x) = 1/(1+e^{-x})$ we obtain the following relation.
\begin{equation} 1 - \sigma(x) = 1 - \dfrac{1}{1+e^{-x}} = \dfrac{e^{-x}}{1+e^{-x}} \label{eq:6-3-3} \end{equation}
Therefore:
\begin{equation} \dfrac{d\sigma}{dx} = \dfrac{1}{1+e^{-x}} \cdot \dfrac{e^{-x}}{1+e^{-x}} = \sigma(x)(1 - \sigma(x)) \label{eq:6-3-4} \end{equation}
Applying element-wise to a vector, the Jacobian matrix becomes diagonal.
\begin{equation} \dfrac{\partial \sigma(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x}))) \label{eq:6-3-5} \end{equation}
6.4 Derivative of the tanh Function
Proof
We recall the definition of $\tanh$ and its relationship to the sigmoid.
\begin{equation} \tanh(x) = \dfrac{e^x - e^{-x}}{e^x + e^{-x}} \label{eq:6-4-1} \end{equation}
Applying the quotient rule with $f = e^x - e^{-x}$ and $g = e^x + e^{-x}$:
\begin{equation} \dfrac{d}{dx}\tanh(x) = \dfrac{f'g - fg'}{g^2} = \dfrac{(e^x + e^{-x})(e^x + e^{-x}) - (e^x - e^{-x})(e^x - e^{-x})}{(e^x + e^{-x})^2} \label{eq:6-4-2} \end{equation}
Expanding the numerator:
\begin{equation} (e^x + e^{-x})^2 - (e^x - e^{-x})^2 = 4 e^x e^{-x} = 4 \label{eq:6-4-3} \end{equation}
Here we used the identity $(a+b)^2 - (a-b)^2 = 4ab$ with $a = e^x$, $b = e^{-x}$, so $ab = 1$. Therefore:
\begin{equation} \dfrac{d}{dx}\tanh(x) = \dfrac{4}{(e^x + e^{-x})^2} = 1 - \tanh^2(x) \label{eq:6-4-4} \end{equation}
The last equality can be verified by substituting $\tanh^2(x) = (e^x - e^{-x})^2/(e^x + e^{-x})^2$. Applying element-wise to a vector:
\begin{equation} \dfrac{\partial \tanh(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(1 - \tanh^2(\boldsymbol{x})) \label{eq:6-4-5} \end{equation}
6.5 Derivative of the ReLU Function
($\mathrm{ReLU}(x) = \max(0, x)$)
Proof
ReLU is a piecewise linear function applied element-wise. We consider each component separately.
\begin{equation} \mathrm{ReLU}(x) = \max(0, x) = \begin{cases} x & (x > 0) \\ 0 & (x \leq 0) \end{cases} \label{eq:6-5-1} \end{equation}
When $x > 0$, $\mathrm{ReLU}(x) = x$ so the derivative is 1. When $x < 0$, $\mathrm{ReLU}(x) = 0$ so the derivative is 0.
\begin{equation} \dfrac{d}{dx}\mathrm{ReLU}(x) = \begin{cases} 1 & (x > 0) \\ 0 & (x < 0) \end{cases} = \mathbf{1}_{x > 0} \label{eq:6-5-2} \end{equation}
At $x = 0$, ReLU is not differentiable, but the subdifferential $[0, 1]$ exists. In deep learning implementations, the convention $\mathrm{ReLU}'(0) = 0$ is used.
Applying element-wise to a vector:
\begin{equation} \dfrac{\partial\, \mathrm{ReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0}) \label{eq:6-5-3} \end{equation}
6.6 Derivative of the Leaky ReLU Function
($\mathrm{LeakyReLU}(x) = \max(\alpha x, x)$, $0 < \alpha < 1$)
Proof
Leaky ReLU is a variant of ReLU that has a small gradient $\alpha$ in the negative region.
\begin{equation} \mathrm{LeakyReLU}(x) = \begin{cases} x & (x > 0) \\ \alpha x & (x \leq 0) \end{cases} \label{eq:6-6-1} \end{equation}
Differentiating in each region:
\begin{equation} \dfrac{d}{dx}\mathrm{LeakyReLU}(x) = \begin{cases} 1 & (x > 0) \\ \alpha & (x \leq 0) \end{cases} = \mathbf{1}_{x > 0} + \alpha \cdot \mathbf{1}_{x \leq 0} \label{eq:6-6-2} \end{equation}
At $x = 0$, the left derivative $\alpha$ and the right derivative $1$ differ, so the function is strictly not differentiable; in practice, the convention is to use $\alpha$ ($0 < \alpha < 1$).
Applying element-wise to a vector:
\begin{equation} \dfrac{\partial\, \mathrm{LeakyReLU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\mathbf{1}_{x_i > 0} + \alpha \cdot \mathbf{1}_{x_i \leq 0}) \label{eq:6-6-3} \end{equation}
6.7 Derivative of Cross-Entropy Loss (softmax + CE)
($\boldsymbol{p} = \mathrm{softmax}(\boldsymbol{x})$)
Proof
Writing the loss function in component form:
\begin{equation} L = -\boldsymbol{y}^\top \log \boldsymbol{p} = -\displaystyle\sum_{i=0}^{N-1} y_i \log p_i \label{eq:6-7-1} \end{equation}
Taking the partial derivative of $L$ with respect to $x_j$ and applying the chain rule:
\begin{equation} \dfrac{\partial L}{\partial x_j} = -\displaystyle\sum_{i=0}^{N-1} y_i \cdot \dfrac{1}{p_i} \cdot \dfrac{\partial p_i}{\partial x_j} \label{eq:6-7-2} \end{equation}
Substituting the softmax Jacobian (6.2, equation \eqref{eq:6-2-4}): since $\partial p_i / \partial x_j = p_i(\delta_{ij} - p_j)$:
\begin{equation} \dfrac{\partial L}{\partial x_j} = -\displaystyle\sum_{i} y_i \cdot \dfrac{1}{p_i} \cdot p_i(\delta_{ij} - p_j) = -\displaystyle\sum_{i} y_i (\delta_{ij} - p_j) \label{eq:6-7-3} \end{equation}
Expanding the sum: since $\displaystyle\sum_i y_i \delta_{ij} = y_j$:
\begin{equation} \dfrac{\partial L}{\partial x_j} = -y_j + p_j \displaystyle\sum_{i} y_i \label{eq:6-7-4} \end{equation}
When $\boldsymbol{y}$ is a probability distribution, $\displaystyle\sum_i y_i = 1$, so:
\begin{equation} \dfrac{\partial L}{\partial x_j} = p_j - y_j \label{eq:6-7-5} \end{equation}
In vector form:
\begin{equation} \dfrac{\partial L}{\partial \boldsymbol{x}} = \boldsymbol{p} - \boldsymbol{y} \label{eq:6-7-6} \end{equation}
6.8 Derivative of Binary Cross-Entropy Loss (sigmoid + BCE)
($\mathrm{BCE} = -y\log\sigma(x) - (1-y)\log(1-\sigma(x))$)
Proof
Writing the loss function explicitly, let $p = \sigma(x)$.
\begin{equation} L = -y \log p - (1 - y)\log(1 - p) \label{eq:6-8-1} \end{equation}
Applying the chain rule with $dp/dx = \sigma(x)(1 - \sigma(x)) = p(1-p)$ (6.3):
\begin{equation} \dfrac{dL}{dx} = \left(-\dfrac{y}{p} + \dfrac{1 - y}{1 - p}\right) \dfrac{dp}{dx} = \left(-\dfrac{y}{p} + \dfrac{1 - y}{1 - p}\right) p(1 - p) \label{eq:6-8-2} \end{equation}
Simplifying each term:
\begin{equation} \dfrac{dL}{dx} = -y(1 - p) + (1 - y)p = -y + yp + p - yp = p - y \label{eq:6-8-3} \end{equation}
Therefore:
\begin{equation} \dfrac{dL}{dx} = \sigma(x) - y \label{eq:6-8-4} \end{equation}
6.9 Derivative of the GELU Function
($\mathrm{GELU}(x) = x \cdot \Phi(x)$, $\Phi$: standard normal CDF, $\phi$: standard normal PDF)
Proof
We recall the definition of GELU (Gaussian Error Linear Unit).
\begin{equation} \mathrm{GELU}(x) = x \, \Phi(x) \label{eq:6-9-1} \end{equation}
Applying the product rule: since $\Phi'(x) = \phi(x)$ (the derivative of the CDF is the PDF):
\begin{equation} \dfrac{d}{dx}\mathrm{GELU}(x) = \Phi(x) + x \, \phi(x) \label{eq:6-9-2} \end{equation}
Applying element-wise to a vector:
\begin{equation} \dfrac{\partial\, \mathrm{GELU}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\Phi(\boldsymbol{x}) + \boldsymbol{x} \odot \phi(\boldsymbol{x})) \label{eq:6-9-3} \end{equation}
6.10 Derivative of the Swish (SiLU) Function
($\mathrm{Swish}(x) = x \cdot \sigma(x)$)
Proof
We recall the definition of Swish (SiLU: Sigmoid Linear Unit).
\begin{equation} \mathrm{Swish}(x) = x \, \sigma(x) \label{eq:6-10-1} \end{equation}
Applying the product rule: since $\sigma'(x) = \sigma(x)(1 - \sigma(x))$ (6.3):
\begin{equation} \dfrac{d}{dx}\mathrm{Swish}(x) = \sigma(x) + x \, \sigma'(x) = \sigma(x) + x \, \sigma(x)(1 - \sigma(x)) \label{eq:6-10-2} \end{equation}
Factoring out $\sigma(x)$, this can also be written as:
\begin{equation} \dfrac{d}{dx}\mathrm{Swish}(x) = \sigma(x)\bigl[1 + x(1 - \sigma(x))\bigr] \label{eq:6-10-3} \end{equation}
Applying element-wise to a vector:
\begin{equation} \dfrac{\partial\, \mathrm{Swish}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \mathrm{diag}(\sigma(\boldsymbol{x}) + \boldsymbol{x} \odot \sigma(\boldsymbol{x}) \odot (1 - \sigma(\boldsymbol{x}))) \label{eq:6-10-4} \end{equation}
Fully Connected Layer, Batch Normalization, and Attention
Backpropagation formulas for the fundamental building blocks of neural networks.
17.1 Weight Gradient of the Fully Connected Layer
Proof
The forward pass of a fully connected layer is expressed as a matrix product.
\begin{equation}\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W} + \boldsymbol{1}_N \boldsymbol{b}^\top \label{eq:17-1-1}\end{equation}
Writing \eqref{eq:17-1-1} in component form, the $(n, j)$ entry of the output is:
\begin{equation}Y_{nj} = \displaystyle\sum_{k=0}^{D_{\text{in}}-1} X_{nk} W_{kj} + b_j \label{eq:17-1-2}\end{equation}
In \eqref{eq:17-1-2}, $Y_{nj}$ depends on $W_{ij}$ only when $k = i$. Computing the partial derivative:
\begin{equation}\dfrac{\partial Y_{nj}}{\partial W_{ij}} = \dfrac{\partial}{\partial W_{ij}} \left( \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \right) = X_{ni} \label{eq:17-1-3}\end{equation}
The loss $L$ depends on $\boldsymbol{W}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_{n,k} \dfrac{\partial L}{\partial Y_{nk}} \dfrac{\partial Y_{nk}}{\partial W_{ij}} \label{eq:17-1-4}\end{equation}
From \eqref{eq:17-1-3}, $\displaystyle\dfrac{\partial Y_{nk}}{\partial W_{ij}}$ is nonzero only when $k = j$. Thus \eqref{eq:17-1-4} simplifies to:
\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_n \dfrac{\partial L}{\partial Y_{nj}} \cdot X_{ni} \label{eq:17-1-5}\end{equation}
Interpreting \eqref{eq:17-1-5} as a matrix product entry: since $(\boldsymbol{X}^\top)_{in} = X_{ni}$:
\begin{equation}\dfrac{\partial L}{\partial W_{ij}} = \displaystyle\sum_n (\boldsymbol{X}^\top)_{in} \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{nj} = \left( \boldsymbol{X}^\top \dfrac{\partial L}{\partial \boldsymbol{Y}} \right)_{ij} \label{eq:17-1-6}\end{equation}
Since \eqref{eq:17-1-6} holds for all $(i, j)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{W}} = \boldsymbol{X}^\top \dfrac{\partial L}{\partial \boldsymbol{Y}} \label{eq:17-1-7}\end{equation}
17.2 Bias Gradient of the Fully Connected Layer
Proof
From \eqref{eq:17-1-2} of 17.1, the component representation of the output is:
\begin{equation}Y_{nj} = \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \label{eq:17-2-1}\end{equation}
In \eqref{eq:17-2-1}, $Y_{nj}$ depends on $b_i$ only when $j = i$. Computing the partial derivative:
\begin{equation}\dfrac{\partial Y_{nj}}{\partial b_i} = \dfrac{\partial}{\partial b_i} \left( \displaystyle\sum_{k} X_{nk} W_{kj} + b_j \right) = \delta_{ij} \label{eq:17-2-2}\end{equation}
Here $\delta_{ij}$ is the Kronecker delta, equal to 1 when $j = i$ and 0 otherwise.
The loss $L$ depends on $\boldsymbol{b}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial b_i} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \dfrac{\partial Y_{nj}}{\partial b_i} \label{eq:17-2-3}\end{equation}
Substituting \eqref{eq:17-2-2} into \eqref{eq:17-2-3}: $\delta_{ij}$ retains only the $j = i$ terms.
\begin{equation}\dfrac{\partial L}{\partial b_i} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \delta_{ij} = \displaystyle\sum_n \dfrac{\partial L}{\partial Y_{ni}} \label{eq:17-2-4}\end{equation}
\eqref{eq:17-2-4} is a sum over the batch direction ($n = 0, 1, \ldots, N-1$). In vector form, $\left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{n,:}$ denotes the $n$-th row (row vector) of the gradient matrix.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{b}} = \displaystyle\sum_{n=0}^{N-1} \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{n,:}^\top \label{eq:17-2-5}\end{equation}
np.sum(dY, axis=0).
17.3 Input Gradient of the Fully Connected Layer
Proof
Writing the forward pass in component form (the bias term is omitted since it does not affect the input gradient):
\begin{equation}Y_{nj} = \displaystyle\sum_{k=0}^{D_{\text{in}}-1} X_{nk} W_{kj} \label{eq:17-3-1}\end{equation}
In \eqref{eq:17-3-1}, $Y_{nj}$ depends on $X_{mi}$ only when $n = m$ and $k = i$. Computing the partial derivative:
\begin{equation}\dfrac{\partial Y_{nj}}{\partial X_{mi}} = \delta_{nm} W_{ij} \label{eq:17-3-2}\end{equation}
Here $\delta_{nm}$ is the Kronecker delta, equal to 1 when $n = m$ and 0 otherwise.
The loss $L$ depends on $\boldsymbol{X}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \dfrac{\partial Y_{nj}}{\partial X_{mi}} \label{eq:17-3-3}\end{equation}
Substituting \eqref{eq:17-3-2} into \eqref{eq:17-3-3}: $\delta_{nm}$ retains only the $n = m$ terms.
\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_{n,j} \dfrac{\partial L}{\partial Y_{nj}} \delta_{nm} W_{ij} = \displaystyle\sum_j \dfrac{\partial L}{\partial Y_{mj}} W_{ij} \label{eq:17-3-4}\end{equation}
Interpreting \eqref{eq:17-3-4} as a matrix product entry: since $(\boldsymbol{W}^\top)_{ji} = W_{ij}$:
\begin{equation}\dfrac{\partial L}{\partial X_{mi}} = \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{Y}}\right)_{mj} (\boldsymbol{W}^\top)_{ji} = \left( \dfrac{\partial L}{\partial \boldsymbol{Y}} \boldsymbol{W}^\top \right)_{mi} \label{eq:17-3-5}\end{equation}
Since \eqref{eq:17-3-5} holds for all $(m, i)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{X}} = \dfrac{\partial L}{\partial \boldsymbol{Y}} \boldsymbol{W}^\top \label{eq:17-3-6}\end{equation}
17.4 Scale Gradient of Batch Normalization
Proof
The output of batch normalization is obtained by transforming the normalized value $\hat{x}_n$ with scale $\gamma$ and shift $\beta$.
\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-4-1}\end{equation}
In \eqref{eq:17-4-1}, $\gamma$ is a parameter shared across all samples. Taking the partial derivative of $y_n$ with respect to $\gamma$:
\begin{equation}\dfrac{\partial y_n}{\partial \gamma} = \dfrac{\partial}{\partial \gamma} (\gamma \hat{x}_n + \beta) = \hat{x}_n \label{eq:17-4-2}\end{equation}
The loss $L$ depends on $\gamma$ through the outputs $y_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial \gamma} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \dfrac{\partial y_n}{\partial \gamma} \label{eq:17-4-3}\end{equation}
Substituting \eqref{eq:17-4-2} into \eqref{eq:17-4-3} gives the final result.
\begin{equation}\dfrac{\partial L}{\partial \gamma} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \hat{x}_n \label{eq:17-4-4}\end{equation}
17.5 Shift Gradient of Batch Normalization
Proof
From \eqref{eq:17-4-1} of 17.4, the output of batch normalization is:
\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-5-1}\end{equation}
In \eqref{eq:17-5-1}, $\beta$ is a parameter shared across all samples. Taking the partial derivative of $y_n$ with respect to $\beta$:
\begin{equation}\dfrac{\partial y_n}{\partial \beta} = \dfrac{\partial}{\partial \beta} (\gamma \hat{x}_n + \beta) = 1 \label{eq:17-5-2}\end{equation}
The loss $L$ depends on $\beta$ through the outputs $y_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial \beta} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \dfrac{\partial y_n}{\partial \beta} \label{eq:17-5-3}\end{equation}
Substituting \eqref{eq:17-5-2} into \eqref{eq:17-5-3} gives the final result.
\begin{equation}\dfrac{\partial L}{\partial \beta} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \cdot 1 = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial y_n} \label{eq:17-5-4}\end{equation}
17.6 Input Gradient of Batch Normalization
Proof
In batch normalization, the input $x_i$ affects the batch statistics $\mu$, $\sigma^2$ and the normalized value $\hat{x}_i$. The computation graph is as follows.
\begin{equation}\mu = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} x_n \label{eq:17-6-1}\end{equation}
\begin{equation}\sigma^2 = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} (x_n - \mu)^2 \label{eq:17-6-2}\end{equation}
\begin{equation}\hat{x}_n = \dfrac{x_n - \mu}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-3}\end{equation}
\begin{equation}y_n = \gamma \hat{x}_n + \beta \label{eq:17-6-4}\end{equation}
[Step 1: Gradient with respect to $\hat{x}_i$]
From \eqref{eq:17-6-4}, taking the partial derivative of $y_i$ with respect to $\hat{x}_i$:
\begin{equation}\dfrac{\partial y_i}{\partial \hat{x}_i} = \gamma \label{eq:17-6-5}\end{equation}
By the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial \hat{x}_i} = \dfrac{\partial L}{\partial y_i} \dfrac{\partial y_i}{\partial \hat{x}_i} = \dfrac{\partial L}{\partial y_i} \gamma \label{eq:17-6-6}\end{equation}
[Step 2: Gradient with respect to $\sigma^2$]
From \eqref{eq:17-6-3}, taking the partial derivative of $\hat{x}_n$ with respect to $\sigma^2$: since $\sqrt{\sigma^2 + \epsilon} = (\sigma^2 + \epsilon)^{1/2}$:
\begin{equation}\dfrac{\partial \hat{x}_n}{\partial \sigma^2} = (x_n - \mu) \cdot \left(-\dfrac{1}{2}\right)(\sigma^2 + \epsilon)^{-3/2} \label{eq:17-6-7}\end{equation}
Applying the chain rule (1.26) over all $n$ and taking the sum:
\begin{equation}\dfrac{\partial L}{\partial \sigma^2} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \dfrac{\partial \hat{x}_n}{\partial \sigma^2} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} (x_n - \mu) \cdot \left(-\dfrac{1}{2}\right)(\sigma^2 + \epsilon)^{-3/2} \label{eq:17-6-8}\end{equation}
[Step 3: Gradient with respect to $\mu$]
$\mu$ affects $\hat{x}_n$ directly and also indirectly through $\sigma^2$. First, computing the direct effect on $\hat{x}_n$: from \eqref{eq:17-6-3}:
\begin{equation}\dfrac{\partial \hat{x}_n}{\partial \mu} = \dfrac{-1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-9}\end{equation}
Next, computing the effect through $\sigma^2$: from \eqref{eq:17-6-2}:
\begin{equation}\dfrac{\partial \sigma^2}{\partial \mu} = \dfrac{1}{N}\displaystyle\sum_{n=0}^{N-1} 2(x_n - \mu)(-1) = \dfrac{-2}{N}\displaystyle\sum_{n=0}^{N-1}(x_n - \mu) \label{eq:17-6-10}\end{equation}
Since $\displaystyle\sum_n (x_n - \mu) = \displaystyle\sum_n x_n - N\mu = N\mu - N\mu = 0$, \eqref{eq:17-6-10} equals 0.
\begin{equation}\dfrac{\partial \sigma^2}{\partial \mu} = 0 \label{eq:17-6-11}\end{equation}
Therefore, the gradient with respect to $\mu$ comes only from the direct path.
\begin{equation}\dfrac{\partial L}{\partial \mu} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \dfrac{\partial \hat{x}_n}{\partial \mu} = \displaystyle\sum_{n=0}^{N-1} \dfrac{\partial L}{\partial \hat{x}_n} \cdot \dfrac{-1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-12}\end{equation}
[Step 4: Gradient with respect to $x_i$]
$x_i$ affects $\hat{x}_i$, $\mu$, and $\sigma^2$. We compute the contribution from each path.
Direct effect on $\hat{x}_i$: from \eqref{eq:17-6-3}:
\begin{equation}\dfrac{\partial \hat{x}_i}{\partial x_i} = \dfrac{1}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-6-13}\end{equation}
Effect on $\mu$: from \eqref{eq:17-6-1}:
\begin{equation}\dfrac{\partial \mu}{\partial x_i} = \dfrac{1}{N} \label{eq:17-6-14}\end{equation}
Effect on $\sigma^2$: from \eqref{eq:17-6-2}:
\begin{equation}\dfrac{\partial \sigma^2}{\partial x_i} = \dfrac{1}{N} \cdot 2(x_i - \mu) = \dfrac{2(x_i - \mu)}{N} \label{eq:17-6-15}\end{equation}
Summing all paths using the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\partial L}{\partial \hat{x}_i} \dfrac{\partial \hat{x}_i}{\partial x_i} + \dfrac{\partial L}{\partial \mu} \dfrac{\partial \mu}{\partial x_i} + \dfrac{\partial L}{\partial \sigma^2} \dfrac{\partial \sigma^2}{\partial x_i} \label{eq:17-6-16}\end{equation}
Substituting \eqref{eq:17-6-6}, \eqref{eq:17-6-12}, \eqref{eq:17-6-8}, \eqref{eq:17-6-13}, \eqref{eq:17-6-14}, \eqref{eq:17-6-15} into \eqref{eq:17-6-16}:
\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\partial L}{\partial \hat{x}_i} \cdot \dfrac{1}{\sqrt{\sigma^2 + \epsilon}} + \dfrac{\partial L}{\partial \mu} \cdot \dfrac{1}{N} + \dfrac{\partial L}{\partial \sigma^2} \cdot \dfrac{2(x_i - \mu)}{N} \label{eq:17-6-17}\end{equation}
Simplifying \eqref{eq:17-6-17}: from \eqref{eq:17-6-6}, $\displaystyle\dfrac{\partial L}{\partial \hat{x}_i} = \gamma \displaystyle\dfrac{\partial L}{\partial y_i}$. Also, from $\hat{x}_i = \displaystyle\dfrac{x_i - \mu}{\sqrt{\sigma^2 + \epsilon}}$, we have $(x_i - \mu) = \hat{x}_i \sqrt{\sigma^2 + \epsilon}$. Substituting these and simplifying gives the final result.
\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\gamma}{\sqrt{\sigma^2 + \epsilon}} \left( \dfrac{\partial L}{\partial y_i} - \dfrac{1}{N}\displaystyle\sum_{j=0}^{N-1} \dfrac{\partial L}{\partial y_j} - \dfrac{\hat{x}_i}{N}\displaystyle\sum_{j=0}^{N-1} \dfrac{\partial L}{\partial y_j}\hat{x}_j \right) \label{eq:17-6-18}\end{equation}
17.7 Input Gradient of Layer Normalization
Proof
In layer normalization, the statistics are computed over the feature dimension rather than the batch dimension. The computation within a single sample is as follows.
\begin{equation}\mu = \dfrac{1}{D}\displaystyle\sum_{d=0}^{D-1} x_d \label{eq:17-7-1}\end{equation}
\begin{equation}\sigma^2 = \dfrac{1}{D}\displaystyle\sum_{d=0}^{D-1} (x_d - \mu)^2 \label{eq:17-7-2}\end{equation}
\begin{equation}\hat{x}_d = \dfrac{x_d - \mu}{\sqrt{\sigma^2 + \epsilon}} \label{eq:17-7-3}\end{equation}
\begin{equation}y_d = \gamma_d \hat{x}_d + \beta_d \label{eq:17-7-4}\end{equation}
The derivation follows the same pattern as 17.6. The only difference from batch normalization is the direction over which statistics are computed.
In the derivation from \eqref{eq:17-6-1} through \eqref{eq:17-6-18} of 17.6, we replace the batch size $N$ with the feature dimension $D$. Each step of the computation is identical, and the final result is:
\begin{equation}\dfrac{\partial L}{\partial x_i} = \dfrac{\gamma_i}{\sqrt{\sigma^2 + \epsilon}} \left( \dfrac{\partial L}{\partial y_i} - \dfrac{1}{D}\displaystyle\sum_{j=0}^{D-1} \dfrac{\partial L}{\partial y_j} - \dfrac{\hat{x}_i}{D}\displaystyle\sum_{j=0}^{D-1} \dfrac{\partial L}{\partial y_j}\hat{x}_j \right) \label{eq:17-7-5}\end{equation}
17.8 Value Gradient of Attention
Proof
The attention output is defined as the matrix product of attention weights $\boldsymbol{A}$ and values $\boldsymbol{V}$.
\begin{equation}\boldsymbol{O} = \boldsymbol{A}\boldsymbol{V} \label{eq:17-8-1}\end{equation}
Writing \eqref{eq:17-8-1} in component form, the $(i, j)$ entry of the output is:
\begin{equation}O_{ij} = \displaystyle\sum_{k=0}^{n-1} A_{ik} V_{kj} \label{eq:17-8-2}\end{equation}
In \eqref{eq:17-8-2}, $O_{ij}$ depends on $V_{pq}$ only when $k = p$ and $j = q$. Computing the partial derivative:
\begin{equation}\dfrac{\partial O_{ij}}{\partial V_{pq}} = A_{ip} \delta_{jq} \label{eq:17-8-3}\end{equation}
Here $\delta_{jq}$ is the Kronecker delta, equal to 1 when $j = q$ and 0 otherwise.
The loss $L$ depends on $\boldsymbol{V}$ through the output $\boldsymbol{O}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \dfrac{\partial O_{ij}}{\partial V_{pq}} \label{eq:17-8-4}\end{equation}
Substituting \eqref{eq:17-8-3} into \eqref{eq:17-8-4}: $\delta_{jq}$ retains only the $j = q$ terms.
\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} A_{ip} \delta_{jq} = \displaystyle\sum_i A_{ip} \dfrac{\partial L}{\partial O_{iq}} \label{eq:17-8-5}\end{equation}
Interpreting \eqref{eq:17-8-5} as a matrix product entry: since $(\boldsymbol{A}^\top)_{pi} = A_{ip}$:
\begin{equation}\dfrac{\partial L}{\partial V_{pq}} = \displaystyle\sum_i (\boldsymbol{A}^\top)_{pi} \left(\dfrac{\partial L}{\partial \boldsymbol{O}}\right)_{iq} = \left( \boldsymbol{A}^\top \dfrac{\partial L}{\partial \boldsymbol{O}} \right)_{pq} \label{eq:17-8-6}\end{equation}
Since \eqref{eq:17-8-6} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{V}} = \boldsymbol{A}^\top \dfrac{\partial L}{\partial \boldsymbol{O}} \label{eq:17-8-7}\end{equation}
17.9 Gradient of Attention Weights
Proof
From \eqref{eq:17-8-2} of 17.8, the component representation of the attention output is:
\begin{equation}O_{ij} = \displaystyle\sum_{k=0}^{n-1} A_{ik} V_{kj} \label{eq:17-9-1}\end{equation}
In \eqref{eq:17-9-1}, $O_{ij}$ depends on $A_{pq}$ only when $i = p$ and $k = q$. Computing the partial derivative:
\begin{equation}\dfrac{\partial O_{ij}}{\partial A_{pq}} = \delta_{ip} V_{qj} \label{eq:17-9-2}\end{equation}
Here $\delta_{ip}$ is the Kronecker delta.
The loss $L$ depends on $\boldsymbol{A}$ through the output $\boldsymbol{O}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \dfrac{\partial O_{ij}}{\partial A_{pq}} \label{eq:17-9-3}\end{equation}
Substituting \eqref{eq:17-9-2} into \eqref{eq:17-9-3}: $\delta_{ip}$ retains only the $i = p$ terms.
\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial O_{ij}} \delta_{ip} V_{qj} = \displaystyle\sum_j \dfrac{\partial L}{\partial O_{pj}} V_{qj} \label{eq:17-9-4}\end{equation}
Interpreting \eqref{eq:17-9-4} as a matrix product entry: since $(\boldsymbol{V}^\top)_{qj} = V_{qj}$:
\begin{equation}\dfrac{\partial L}{\partial A_{pq}} = \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{O}}\right)_{pj} (\boldsymbol{V}^\top)_{qj} = \left( \dfrac{\partial L}{\partial \boldsymbol{O}} \boldsymbol{V}^\top \right)_{pq} \label{eq:17-9-5}\end{equation}
Since \eqref{eq:17-9-5} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{A}} = \dfrac{\partial L}{\partial \boldsymbol{O}} \boldsymbol{V}^\top \label{eq:17-9-6}\end{equation}
17.10 Score Gradient (pre-softmax)
Proof
The attention weights $\boldsymbol{A}$ are computed as the row-wise softmax of the scores $\boldsymbol{S}$.
\begin{equation}\boldsymbol{A} = \text{softmax}(\boldsymbol{S}) \label{eq:17-10-1}\end{equation}
From the definition of softmax, the entry $A_{ij}$ of row $i$ is:
\begin{equation}A_{ij} = \dfrac{e^{S_{ij}}}{\displaystyle\sum_{l=0}^{n-1} e^{S_{il}}} \label{eq:17-10-2}\end{equation}
We compute the Jacobian of the softmax. Taking the partial derivative of $A_{ij}$ with respect to $S_{ik}$, separating the cases $j = k$ and $j \neq k$:
When $j = k$: applying the quotient rule (1.28):
\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ij}} = \dfrac{e^{S_{ij}} \displaystyle\sum_l e^{S_{il}} - e^{S_{ij}} e^{S_{ij}}}{(\displaystyle\sum_l e^{S_{il}})^2} = A_{ij} - A_{ij}^2 = A_{ij}(1 - A_{ij}) \label{eq:17-10-3}\end{equation}
When $j \neq k$:
\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ik}} = \dfrac{0 \cdot \displaystyle\sum_l e^{S_{il}} - e^{S_{ij}} e^{S_{ik}}}{(\displaystyle\sum_l e^{S_{il}})^2} = -A_{ij} A_{ik} \label{eq:17-10-4}\end{equation}
Unifying \eqref{eq:17-10-3} and \eqref{eq:17-10-4} using the Kronecker delta:
\begin{equation}\dfrac{\partial A_{ij}}{\partial S_{ik}} = A_{ij}(\delta_{jk} - A_{ik}) \label{eq:17-10-5}\end{equation}
The loss $L$ depends on $\boldsymbol{S}$ through $\boldsymbol{A}$. Applying the chain rule (1.26) for row $i$:
\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \displaystyle\sum_{j=0}^{n-1} \dfrac{\partial L}{\partial A_{ij}} \dfrac{\partial A_{ij}}{\partial S_{ik}} \label{eq:17-10-6}\end{equation}
Substituting \eqref{eq:17-10-5} into \eqref{eq:17-10-6}:
\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij}(\delta_{jk} - A_{ik}) \label{eq:17-10-7}\end{equation}
Expanding \eqref{eq:17-10-7}: splitting into the first term where $\delta_{jk}$ retains only $j = k$, and the second term which is a sum over all $j$:
\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = \dfrac{\partial L}{\partial A_{ik}} A_{ik} - A_{ik} \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij} \label{eq:17-10-8}\end{equation}
Factoring out $A_{ik}$ from \eqref{eq:17-10-8}:
\begin{equation}\dfrac{\partial L}{\partial S_{ik}} = A_{ik} \left( \dfrac{\partial L}{\partial A_{ik}} - \displaystyle\sum_j \dfrac{\partial L}{\partial A_{ij}} A_{ij} \right) \label{eq:17-10-9}\end{equation}
Writing \eqref{eq:17-10-9} in matrix form: $\displaystyle\sum_j \displaystyle\dfrac{\partial L}{\partial A_{ij}} A_{ij}$ is the row $i$ sum of $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{A}} \odot \boldsymbol{A}$, denoted $\text{rowsum}(\cdot)$.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{S}} = \boldsymbol{A} \odot \left( \dfrac{\partial L}{\partial \boldsymbol{A}} - \text{rowsum}\left(\dfrac{\partial L}{\partial \boldsymbol{A}} \odot \boldsymbol{A}\right) \boldsymbol{1}^\top \right) \label{eq:17-10-10}\end{equation}
17.11 Query Gradient
Proof
The attention score $\boldsymbol{S}$ is the scaled inner product of Query and Key.
\begin{equation}\boldsymbol{S} = \dfrac{1}{\sqrt{d_k}} \boldsymbol{Q}\boldsymbol{K}^\top \label{eq:17-11-1}\end{equation}
Writing \eqref{eq:17-11-1} in component form, the $(i, j)$ entry of the score is:
\begin{equation}S_{ij} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{k=0}^{d_k-1} Q_{ik} K_{jk} \label{eq:17-11-2}\end{equation}
In \eqref{eq:17-11-2}, $S_{ij}$ depends on $Q_{pq}$ only when $i = p$ and $k = q$. Computing the partial derivative:
\begin{equation}\dfrac{\partial S_{ij}}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \delta_{ip} K_{jq} \label{eq:17-11-3}\end{equation}
The loss $L$ depends on $\boldsymbol{Q}$ through the scores $\boldsymbol{S}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \dfrac{\partial S_{ij}}{\partial Q_{pq}} \label{eq:17-11-4}\end{equation}
Substituting \eqref{eq:17-11-3} into \eqref{eq:17-11-4}: $\delta_{ip}$ retains only the $i = p$ terms.
\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \delta_{ip} K_{jq} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_j \dfrac{\partial L}{\partial S_{pj}} K_{jq} \label{eq:17-11-5}\end{equation}
Interpreting \eqref{eq:17-11-5} as a matrix product entry:
\begin{equation}\dfrac{\partial L}{\partial Q_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_j \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)_{pj} K_{jq} = \dfrac{1}{\sqrt{d_k}} \left( \dfrac{\partial L}{\partial \boldsymbol{S}} \boldsymbol{K} \right)_{pq} \label{eq:17-11-6}\end{equation}
Since \eqref{eq:17-11-6} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{Q}} = \dfrac{1}{\sqrt{d_k}} \dfrac{\partial L}{\partial \boldsymbol{S}} \boldsymbol{K} \label{eq:17-11-7}\end{equation}
17.12 Key Gradient
Proof
From \eqref{eq:17-11-2} of 17.11, the component representation of the score is:
\begin{equation}S_{ij} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{k=0}^{d_k-1} Q_{ik} K_{jk} \label{eq:17-12-1}\end{equation}
In \eqref{eq:17-12-1}, $S_{ij}$ depends on $K_{pq}$ only when $j = p$ and $k = q$. Computing the partial derivative:
\begin{equation}\dfrac{\partial S_{ij}}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} Q_{iq} \delta_{jp} \label{eq:17-12-2}\end{equation}
The loss $L$ depends on $\boldsymbol{K}$ through the scores $\boldsymbol{S}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} \dfrac{\partial S_{ij}}{\partial K_{pq}} \label{eq:17-12-3}\end{equation}
Substituting \eqref{eq:17-12-2} into \eqref{eq:17-12-3}: $\delta_{jp}$ retains only the $j = p$ terms.
\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial S_{ij}} Q_{iq} \delta_{jp} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_i \dfrac{\partial L}{\partial S_{ip}} Q_{iq} \label{eq:17-12-4}\end{equation}
Interpreting \eqref{eq:17-12-4} as a matrix product entry: since $\left(\left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top\right)_{pi} = \left(\displaystyle\dfrac{\partial L}{\partial \boldsymbol{S}}\right)_{ip}$:
\begin{equation}\dfrac{\partial L}{\partial K_{pq}} = \dfrac{1}{\sqrt{d_k}} \displaystyle\sum_i \left(\left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top\right)_{pi} Q_{iq} = \dfrac{1}{\sqrt{d_k}} \left( \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top \boldsymbol{Q} \right)_{pq} \label{eq:17-12-5}\end{equation}
Since \eqref{eq:17-12-5} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{K}} = \dfrac{1}{\sqrt{d_k}} \left(\dfrac{\partial L}{\partial \boldsymbol{S}}\right)^\top \boldsymbol{Q} \label{eq:17-12-6}\end{equation}
17.13 Filter Gradient of the Convolutional Layer
Proof
Consider a 2D convolution. The output $\boldsymbol{Y}$ is computed by convolving the input $\boldsymbol{X}$ with the filter $\boldsymbol{F}$.
\begin{equation}\boldsymbol{Y} = \boldsymbol{X} * \boldsymbol{F} \label{eq:17-13-1}\end{equation}
Writing \eqref{eq:17-13-1} in component form with filter size $H \times W$:
\begin{equation}Y_{ij} = \displaystyle\sum_{m=0}^{H-1} \displaystyle\sum_{n=0}^{W-1} X_{i+m, j+n} F_{mn} \label{eq:17-13-2}\end{equation}
In \eqref{eq:17-13-2}, $Y_{ij}$ depends on $F_{pq}$ only when $m = p$ and $n = q$. Computing the partial derivative:
\begin{equation}\dfrac{\partial Y_{ij}}{\partial F_{pq}} = X_{i+p, j+q} \label{eq:17-13-3}\end{equation}
The loss $L$ depends on $\boldsymbol{F}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial F_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} \dfrac{\partial Y_{ij}}{\partial F_{pq}} \label{eq:17-13-4}\end{equation}
Substituting \eqref{eq:17-13-3} into \eqref{eq:17-13-4}:
\begin{equation}\dfrac{\partial L}{\partial F_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} X_{i+p, j+q} \label{eq:17-13-5}\end{equation}
\eqref{eq:17-13-5} is the cross-correlation of the input $\boldsymbol{X}$ and the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{F}} = \boldsymbol{X} \star \dfrac{\partial L}{\partial \boldsymbol{Y}} \label{eq:17-13-6}\end{equation}
17.14 Input Gradient of the Convolutional Layer
Proof
From \eqref{eq:17-13-2} of 17.13, the component representation of the convolution is:
\begin{equation}Y_{ij} = \displaystyle\sum_{m=0}^{H-1} \displaystyle\sum_{n=0}^{W-1} X_{i+m, j+n} F_{mn} \label{eq:17-14-1}\end{equation}
In \eqref{eq:17-14-1}, we consider the condition under which the input element $X_{pq}$ contributes to the output $Y_{ij}$. $X_{pq} = X_{i+m, j+n}$ when $i = p - m$ and $j = q - n$.
For the output indices $(i, j)$ to be in a valid range, we need $0 \leq p - m$ and $0 \leq q - n$. That is, $X_{pq}$ may contribute to multiple output elements.
Computing the partial derivative: when $X_{pq}$ contributes to $Y_{ij}$:
\begin{equation}\dfrac{\partial Y_{ij}}{\partial X_{pq}} = F_{p-i, q-j} \label{eq:17-14-2}\end{equation}
provided that $(p-i, q-j)$ is within the valid range $[0, H-1] \times [0, W-1]$ of the filter.
The loss $L$ depends on $\boldsymbol{X}$ through the output $\boldsymbol{Y}$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{i,j} \dfrac{\partial L}{\partial Y_{ij}} \dfrac{\partial Y_{ij}}{\partial X_{pq}} \label{eq:17-14-3}\end{equation}
Substituting \eqref{eq:17-14-2} into \eqref{eq:17-14-3}: the sum is over all $(i, j)$ to which $X_{pq}$ contributes.
\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{i,j: (p-i, q-j) \in [0, H-1] \times [0, W-1]} \dfrac{\partial L}{\partial Y_{ij}} F_{p-i, q-j} \label{eq:17-14-4}\end{equation}
Interpreting \eqref{eq:17-14-4}: performing the change of variables $m' = p - i$, $n' = q - j$, we get $i = p - m'$, $j = q - n'$, and the sum is over $m' \in [0, H-1]$, $n' \in [0, W-1]$.
\begin{equation}\dfrac{\partial L}{\partial X_{pq}} = \displaystyle\sum_{m'=0}^{H-1} \displaystyle\sum_{n'=0}^{W-1} \dfrac{\partial L}{\partial Y_{p-m', q-n'}} F_{m', n'} \label{eq:17-14-5}\end{equation}
\eqref{eq:17-14-5} corresponds to a "full" convolution of the output gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{Y}}$ with the 180-degree rotated filter $\text{rot}_{180}(\boldsymbol{F})$.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{X}} = \dfrac{\partial L}{\partial \boldsymbol{Y}} *_{\text{full}} \text{rot}_{180}(\boldsymbol{F}) \label{eq:17-14-6}\end{equation}
17.15 Gradient of Max Pooling
Proof
Max pooling selects the maximum value within each pooling region.
\begin{equation}Y_k = \max_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-15-1}\end{equation}
In \eqref{eq:17-15-1}, let $(i^*, j^*)$ be the position that achieves the maximum.
\begin{equation}(i^*, j^*) = \arg\max_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-15-2}\end{equation}
Consider the derivative of the max function. $Y_k$ depends only on the element $X_{i^*, j^*}$ that achieves the maximum, and does not depend on other elements (locally). The partial derivative is therefore:
\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \begin{cases} 1 & \text{if } (i, j) = (i^*, j^*) \\ 0 & \text{otherwise} \end{cases} \label{eq:17-15-3}\end{equation}
\eqref{eq:17-15-3} can be written using the indicator function $\mathbf{1}_{X_{ij} = \max}$:
\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \mathbf{1}_{X_{ij} = \max\{X_{pq}: (p,q) \in \text{pool}_k\}} \label{eq:17-15-4}\end{equation}
The loss $L$ depends on $X_{ij}$ through the output $Y_k$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \dfrac{\partial Y_k}{\partial X_{ij}} \label{eq:17-15-5}\end{equation}
Substituting \eqref{eq:17-15-4} into \eqref{eq:17-15-5} gives the final result.
\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \cdot \mathbf{1}_{X_{ij} = \max} \label{eq:17-15-6}\end{equation}
17.16 Gradient of Average Pooling
Proof
Average pooling computes the mean value within each pooling region.
\begin{equation}Y_k = \dfrac{1}{|\text{pool}_k|} \displaystyle\sum_{(i,j) \in \text{pool}_k} X_{ij} \label{eq:17-16-1}\end{equation}
Here $|\text{pool}_k|$ is the number of elements in the pooling region.
In \eqref{eq:17-16-1}, $Y_k$ depends equally on all elements within the pooling region. Computing the partial derivative:
\begin{equation}\dfrac{\partial Y_k}{\partial X_{ij}} = \dfrac{1}{|\text{pool}_k|} \quad \text{for all } (i, j) \in \text{pool}_k \label{eq:17-16-2}\end{equation}
The loss $L$ depends on $X_{ij}$ through the output $Y_k$. Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{\partial L}{\partial Y_k} \dfrac{\partial Y_k}{\partial X_{ij}} \label{eq:17-16-3}\end{equation}
Substituting \eqref{eq:17-16-2} into \eqref{eq:17-16-3} gives the final result.
\begin{equation}\dfrac{\partial L}{\partial X_{ij}} = \dfrac{1}{|\text{pool}_k|} \dfrac{\partial L}{\partial Y_k} \label{eq:17-16-4}\end{equation}
17.17 Gradient of the Embedding Layer
Proof
The embedding layer functions as a lookup table. It outputs the row of the embedding matrix corresponding to the input index $\text{idx}_n$.
\begin{equation}\boldsymbol{o}_n = \boldsymbol{E}_{\text{idx}_n, :} \label{eq:17-17-1}\end{equation}
Here $\boldsymbol{E} \in \mathbb{R}^{V \times d}$ is the embedding matrix with vocabulary size $V$ and embedding dimension $d$.
In \eqref{eq:17-17-1}, the output $\boldsymbol{o}_n$ is a copy of the $\text{idx}_n$-th row of the embedding matrix. Considering the partial derivative:
\begin{equation}\dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} = \begin{cases} 1 & \text{if } i = \text{idx}_n \text{ and } j = k \\ 0 & \text{otherwise} \end{cases} \label{eq:17-17-2}\end{equation}
\eqref{eq:17-17-2} can be written using the Kronecker delta:
\begin{equation}\dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} = \delta_{i, \text{idx}_n} \delta_{jk} \label{eq:17-17-3}\end{equation}
The loss $L$ depends on $\boldsymbol{E}$ through the outputs $\boldsymbol{o}_n$ ($n = 0, 1, \ldots, N-1$). Applying the chain rule (1.26):
\begin{equation}\dfrac{\partial L}{\partial E_{ik}} = \displaystyle\sum_n \displaystyle\sum_j \dfrac{\partial L}{\partial (\boldsymbol{o}_n)_j} \dfrac{\partial (\boldsymbol{o}_n)_j}{\partial E_{ik}} \label{eq:17-17-4}\end{equation}
Substituting \eqref{eq:17-17-3} into \eqref{eq:17-17-4}: $\delta_{i, \text{idx}_n}$ retains only the terms with $\text{idx}_n = i$, and $\delta_{jk}$ retains only $j = k$.
\begin{equation}\dfrac{\partial L}{\partial E_{ik}} = \displaystyle\sum_{n: \text{idx}_n = i} \dfrac{\partial L}{\partial (\boldsymbol{o}_n)_k} \label{eq:17-17-5}\end{equation}
Writing \eqref{eq:17-17-5} in row vector form gives the final result.
\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{E}_{i,:}} = \displaystyle\sum_{n: \text{idx}_n = i} \dfrac{\partial L}{\partial \boldsymbol{o}_n} \label{eq:17-17-6}\end{equation}
17.18 Gradient of L2 Regularization
Proof
The L2 regularization term is the squared Frobenius norm of the weight matrix multiplied by a scaling coefficient.
\begin{equation}R = \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 \label{eq:17-18-1}\end{equation}
Here $\lambda > 0$ is a hyperparameter controlling the regularization strength.
Taking the partial derivative of \eqref{eq:17-18-1} with respect to $W_{pq}$: only the $(i, j) = (p, q)$ term in the sum contains $W_{pq}$.
\begin{equation}\dfrac{\partial R}{\partial W_{pq}} = \dfrac{\partial}{\partial W_{pq}} \left( \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 \right) = \dfrac{\lambda}{2} \cdot 2W_{pq} = \lambda W_{pq} \label{eq:17-18-2}\end{equation}
Since \eqref{eq:17-18-2} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{W}} \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W} \label{eq:17-18-3}\end{equation}
For a more concise derivation, see 12.10.
17.19 Subgradient of L1 Regularization
Proof
The L1 regularization term is the L1 norm of the weight matrix multiplied by a scaling coefficient.
\begin{equation}R = \lambda\|\boldsymbol{W}\|_1 = \lambda \displaystyle\sum_{i,j} |W_{ij}| \label{eq:17-19-1}\end{equation}
Consider the derivative of the absolute value function $|x|$, separating the cases $x > 0$ and $x < 0$.
\begin{equation}\dfrac{d|x|}{dx} = \begin{cases} 1 & x > 0 \\ -1 & x < 0 \end{cases} \label{eq:17-19-2}\end{equation}
At $x = 0$, the absolute value function is not differentiable. However, using the concept of subgradients, the subgradient at $x = 0$ can take any value in the interval $[-1, 1]$.
\begin{equation}\partial |x| \Big|_{x=0} = [-1, 1] \label{eq:17-19-3}\end{equation}
Unifying \eqref{eq:17-19-2} and \eqref{eq:17-19-3} using the sign function $\text{sign}(x)$: in practice, $\text{sign}(0) = 0$ is commonly used.
\begin{equation}\dfrac{d|x|}{dx} = \text{sign}(x) = \begin{cases} 1 & x > 0 \\ 0 & x = 0 \\ -1 & x < 0 \end{cases} \label{eq:17-19-4}\end{equation}
Taking the partial derivative of \eqref{eq:17-19-1} with respect to $W_{pq}$: only the $(i, j) = (p, q)$ term in the sum contains $|W_{pq}|$.
\begin{equation}\dfrac{\partial R}{\partial W_{pq}} = \lambda \dfrac{\partial |W_{pq}|}{\partial W_{pq}} = \lambda \cdot \text{sign}(W_{pq}) \label{eq:17-19-5}\end{equation}
Since \eqref{eq:17-19-5} holds for all $(p, q)$, we obtain the final result in matrix form.
\begin{equation}\dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \text{sign}(\boldsymbol{W}) \label{eq:17-19-6}\end{equation}
For a more concise derivation, see 12.11.
17.20 KL Divergence for Gaussian Distributions
Proof
The definition of KL divergence is:
\begin{equation}D_{\text{KL}}(p \| q) = \mathbb{E}_p[\log p - \log q] = \displaystyle\int p(x) \log \dfrac{p(x)}{q(x)} dx \label{eq:17-20-1}\end{equation}
First, consider the one-dimensional case. Let $p = \mathcal{N}(\mu, \sigma^2)$ and $q = \mathcal{N}(0, 1)$.
Writing the log of the normal distributions:
\begin{equation}\log p(x) = -\dfrac{1}{2}\log(2\pi\sigma^2) - \dfrac{(x-\mu)^2}{2\sigma^2} \label{eq:17-20-2}\end{equation}
\begin{equation}\log q(x) = -\dfrac{1}{2}\log(2\pi) - \dfrac{x^2}{2} \label{eq:17-20-3}\end{equation}
Computing the difference of \eqref{eq:17-20-2} and \eqref{eq:17-20-3}:
\begin{equation}\log p(x) - \log q(x) = -\dfrac{1}{2}\log\sigma^2 - \dfrac{(x-\mu)^2}{2\sigma^2} + \dfrac{x^2}{2} \label{eq:17-20-4}\end{equation}
Computing the expectation of \eqref{eq:17-20-4} under $p$, using the properties of expectations:
\begin{equation}\mathbb{E}_p[(x-\mu)^2] = \sigma^2 \label{eq:17-20-5}\end{equation}
\begin{equation}\mathbb{E}_p[x^2] = \text{Var}(x) + (\mathbb{E}_p[x])^2 = \sigma^2 + \mu^2 \label{eq:17-20-6}\end{equation}
Substituting \eqref{eq:17-20-5} and \eqref{eq:17-20-6} into the expectation of \eqref{eq:17-20-4}:
\begin{equation}D_{\text{KL}} = \mathbb{E}_p[\log p - \log q] = -\dfrac{1}{2}\log\sigma^2 - \dfrac{\sigma^2}{2\sigma^2} + \dfrac{\sigma^2 + \mu^2}{2} \label{eq:17-20-7}\end{equation}
Simplifying \eqref{eq:17-20-7}:
\begin{equation}D_{\text{KL}} = -\dfrac{1}{2}\log\sigma^2 - \dfrac{1}{2} + \dfrac{\sigma^2 + \mu^2}{2} = \dfrac{1}{2}(\mu^2 + \sigma^2 - 1 - \log\sigma^2) \label{eq:17-20-8}\end{equation}
In the multi-dimensional case with independent dimensions, the KL divergence is the sum over each dimension.
\begin{equation}D_{\text{KL}}(\mathcal{N}(\boldsymbol{\mu}, \text{diag}(\boldsymbol{\sigma}^2)) \| \mathcal{N}(\boldsymbol{0}, \boldsymbol{I})) = \dfrac{1}{2}\displaystyle\sum_{i=1}^{d} (\mu_i^2 + \sigma_i^2 - 1 - \log\sigma_i^2) \label{eq:17-20-9}\end{equation}
17.21 Gradient of KL Divergence with Respect to the Mean
Proof
From \eqref{eq:17-20-9} of 17.20, the KL divergence is:
\begin{equation}D_{\text{KL}} = \dfrac{1}{2}\displaystyle\sum_{j=1}^{d} (\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2) \label{eq:17-21-1}\end{equation}
In \eqref{eq:17-21-1}, the only term involving $\mu_i$ is $\displaystyle\dfrac{1}{2}\mu_i^2$. Terms for other dimensions $j \neq i$ do not depend on $\mu_i$.
Taking the partial derivative of \eqref{eq:17-21-1} with respect to $\mu_i$:
\begin{equation}\dfrac{\partial D_{\text{KL}}}{\partial \mu_i} = \dfrac{\partial}{\partial \mu_i} \dfrac{1}{2}\mu_i^2 = \dfrac{1}{2} \cdot 2\mu_i = \mu_i \label{eq:17-21-2}\end{equation}
17.22 Gradient of KL Divergence with Respect to the Standard Deviation
Proof
From \eqref{eq:17-21-1} of 17.21, the KL divergence is:
\begin{equation}D_{\text{KL}} = \dfrac{1}{2}\displaystyle\sum_{j=1}^{d} (\mu_j^2 + \sigma_j^2 - 1 - \log\sigma_j^2) \label{eq:17-22-1}\end{equation}
In \eqref{eq:17-22-1}, the terms involving $\sigma_i$ are $\displaystyle\dfrac{1}{2}(\sigma_i^2 - \log\sigma_i^2)$. Terms for other dimensions $j \neq i$ do not depend on $\sigma_i$.
Noting that $\log\sigma_i^2 = 2\log\sigma_i$, we take the partial derivative with respect to $\sigma_i$:
\begin{equation}\dfrac{\partial}{\partial \sigma_i} \sigma_i^2 = 2\sigma_i \label{eq:17-22-2}\end{equation}
\begin{equation}\dfrac{\partial}{\partial \sigma_i} \log\sigma_i^2 = \dfrac{\partial}{\partial \sigma_i} (2\log\sigma_i) = \dfrac{2}{\sigma_i} \label{eq:17-22-3}\end{equation}
Using \eqref{eq:17-22-2} and \eqref{eq:17-22-3} to take the partial derivative of $D_{\text{KL}}$ with respect to $\sigma_i$:
\begin{equation}\dfrac{\partial D_{\text{KL}}}{\partial \sigma_i} = \dfrac{1}{2}\left(2\sigma_i - \dfrac{2}{\sigma_i}\right) = \sigma_i - \dfrac{1}{\sigma_i} \label{eq:17-22-4}\end{equation}
12.5 Derivatives of Regularization
Gradient computation for regularization terms commonly used in machine learning.
12.10 L2 Regularization (Weight Decay)
Proof
The squared Frobenius norm is $\|\boldsymbol{W}\|_F^2 = \mathrm{tr}(\boldsymbol{W}^\top \boldsymbol{W}) = \displaystyle\sum_{i,j} W_{ij}^2$.
\begin{equation} \dfrac{\partial}{\partial W_{kl}} \dfrac{\lambda}{2} \displaystyle\sum_{i,j} W_{ij}^2 = \dfrac{\lambda}{2} \cdot 2 W_{kl} = \lambda W_{kl} \label{eq:12-10-1} \end{equation}
Collecting all entries in matrix form:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{W}} \dfrac{\lambda}{2}\|\boldsymbol{W}\|_F^2 = \lambda \boldsymbol{W} \label{eq:12-10-2} \end{equation}
12.11 L1 Regularization (Subgradient)
Proof
Since $\|\boldsymbol{W}\|_1 = \displaystyle\sum_{i,j} |W_{ij}|$, each component $|W_{kl}|$ can be differentiated independently of the others.
\begin{equation} \dfrac{\partial |W_{kl}|}{\partial W_{kl}} = \begin{cases} 1 & (W_{kl} > 0) \\ -1 & (W_{kl} < 0) \\ [-1, 1] & (W_{kl} = 0) \end{cases} = \mathrm{sign}(W_{kl}) \label{eq:12-11-1} \end{equation}
Here, when $W_{kl} = 0$, the absolute value function is not differentiable, but the subdifferential $\partial|W_{kl}| = [-1, 1]$ exists. In practice, the approximation $\mathrm{sign}(0) = 0$ is used.
Collecting all entries:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{W}} \lambda\|\boldsymbol{W}\|_1 = \lambda \cdot \mathrm{sign}(\boldsymbol{W}) \label{eq:12-11-2} \end{equation}
12.12 LASSO Gradient (Regression with L1 Regularization)
Proof
We decompose the objective function as $L = L_{\text{data}} + L_{\text{reg}}$.
\begin{equation} L_{\text{data}} = \dfrac{1}{2}\|\boldsymbol{x} - \boldsymbol{D}\boldsymbol{\alpha}\|^2, \qquad L_{\text{reg}} = \lambda\|\boldsymbol{\alpha}\|_1 \label{eq:12-12-1} \end{equation}
The gradient of the data term is computed similarly to 12.9. Letting $\boldsymbol{r} = \boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}$:
\begin{equation} \dfrac{\partial L_{\text{data}}}{\partial \boldsymbol{\alpha}} = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) \label{eq:12-12-2} \end{equation}
The subgradient of the regularization term follows from 12.11:
\begin{equation} \dfrac{\partial L_{\text{reg}}}{\partial \boldsymbol{\alpha}} = \lambda \cdot \mathrm{sign}(\boldsymbol{\alpha}) \label{eq:12-12-3} \end{equation}
Combining \eqref{eq:12-12-2} and \eqref{eq:12-12-3} gives the formula.
\begin{equation} \dfrac{\partial L}{\partial \boldsymbol{\alpha}} = \boldsymbol{D}^\top(\boldsymbol{D}\boldsymbol{\alpha} - \boldsymbol{x}) + \lambda \cdot \mathrm{sign}(\boldsymbol{\alpha}) \label{eq:12-12-4} \end{equation}
12.6 Application: Regularization for Image Reconstruction
Gradient computation for regularization methods applied to ill-posed problems, used in medical image reconstruction and signal processing.
12.13 Gradient of Tikhonov Regularization
Proof
Computing the gradient of the first term $\|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 = (\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y})^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y})$:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}}\|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) \label{eq:12-13-1} \end{equation}
Computing the gradient of the second term $\lambda\|\boldsymbol{L}\boldsymbol{x}\|^2 = \lambda\boldsymbol{x}^\top\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x}$:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}}\lambda\|\boldsymbol{L}\boldsymbol{x}\|^2 = 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x} \label{eq:12-13-2} \end{equation}
Combining \eqref{eq:12-13-1} and \eqref{eq:12-13-2} gives the formula.
12.14 Tikhonov Regularization Solution
Proof
From the optimality condition $\nabla J = 0$:
\begin{equation} 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) + 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x} = \boldsymbol{0} \label{eq:12-14-1} \end{equation}
Rearranging \eqref{eq:12-14-1}:
\begin{equation} (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})\boldsymbol{x} = \boldsymbol{A}^\top\boldsymbol{y} \label{eq:12-14-2} \end{equation}
When $\lambda > 0$, $\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L}$ is positive definite (or positive semidefinite + positive definite) and hence nonsingular, giving the solution.
\begin{equation} \boldsymbol{x}^* = (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})^{-1}\boldsymbol{A}^\top\boldsymbol{y} \label{eq:12-14-3} \end{equation}
12.15 Subgradient of Total Variation Regularization
Proof
The isotropic total variation is defined as:
\begin{equation} \text{TV}(\boldsymbol{x}) = \displaystyle\int_\Omega |\nabla \boldsymbol{x}| \, d\Omega = \displaystyle\int_\Omega \sqrt{|\partial_1 x|^2 + |\partial_2 x|^2} \, d\Omega \label{eq:12-15-1} \end{equation}
Using the calculus of variations, we perturb $\boldsymbol{x}$ by $\boldsymbol{x} + \varepsilon\boldsymbol{\phi}$, differentiate with respect to $\varepsilon$, and evaluate at $\varepsilon = 0$.
\begin{equation} \dfrac{d}{d\varepsilon}\text{TV}(\boldsymbol{x} + \varepsilon\boldsymbol{\phi})\bigg|_{\varepsilon=0} = \displaystyle\int_\Omega \dfrac{\nabla \boldsymbol{x} \cdot \nabla \boldsymbol{\phi}}{|\nabla \boldsymbol{x}|} \, d\Omega \label{eq:12-15-2} \end{equation}
Applying integration by parts (Green's theorem) and assuming the boundary terms vanish:
\begin{equation} = -\displaystyle\int_\Omega \text{div}\left(\dfrac{\nabla \boldsymbol{x}}{|\nabla \boldsymbol{x}|}\right) \boldsymbol{\phi} \, d\Omega \label{eq:12-15-3} \end{equation}
Therefore $\delta \text{TV} / \delta \boldsymbol{x} = -\text{div}(\nabla \boldsymbol{x} / |\nabla \boldsymbol{x}|)$.