Proofs Chapter 12: Derivatives of Norms (Basic Formulas)
This chapter proves the differentiation formulas for vector norms and matrix norms (Frobenius norm). Derivatives of norms are fundamental formulas directly applicable to optimization and statistical learning: least squares gradient derivation in linear regression, subgradient computation in L1/L2 regularization, nuclear norm minimization in matrix completion problems, and sparse recovery in signal processing. We cover practical formulas ranging from derivatives of the 2-norm to derivatives of normalized vectors.
Prerequisites: Chapter 2 (Scalar differentiated by vector), Chapter 5 (Differentiation of trace). Related chapters: Chapter 10 (Differentiation of quadratic forms), Chapter 15 (Differentiation of special matrices).
12. Derivatives of Norms
The formulas in this chapter apply under the following conditions unless otherwise noted:
- All formulas are based on denominator layout
- Norm $\|\boldsymbol{x}\|$ is not differentiable at $\boldsymbol{x} = \boldsymbol{0}$ (subgradient is used when applicable)
- The squared norm $\|\boldsymbol{x}\|^2$ is differentiable everywhere
We derive differentiation formulas for vector norms and matrix norms, particularly the Frobenius norm.
12.1 Derivative of Vector 2-Norm
Proof
Introduce an auxiliary variable $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$.
\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-1-1}\end{equation}
Since $\boldsymbol{a}$ is a constant vector, the derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.
\begin{equation}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-1-2}\end{equation}
From the definition of the vector 2-norm, $\|\boldsymbol{u}\|_2$ can be expressed as follows.
\begin{equation}\|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}} = \sqrt{\sum_{i=0}^{n-1} u_i^2} \label{eq:12-1-3}\end{equation}
Let $f = \|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}}$. Define the inner function $g = \boldsymbol{u}^\top\boldsymbol{u}$.
\begin{equation}g = \boldsymbol{u}^\top\boldsymbol{u} \label{eq:12-1-4}\end{equation}
Then $f = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.
\begin{equation}\frac{\partial f}{\partial g} = \frac{\partial}{\partial g} \sqrt{g} = \frac{1}{2\sqrt{g}} = \frac{1}{2\|\boldsymbol{u}\|_2} \label{eq:12-1-5}\end{equation}
Next, we compute the derivative of $g = \boldsymbol{u}^\top\boldsymbol{u}$ with respect to $\boldsymbol{u}$. Expand $g$ component-wise.
\begin{equation}g = \sum_{i=0}^{n-1} u_i^2 \label{eq:12-1-6}\end{equation}
Take the partial derivative of $g$ with respect to $u_j$.
\begin{equation}\frac{\partial g}{\partial u_j} = \frac{\partial}{\partial u_j} \sum_{i=0}^{n-1} u_i^2 = 2u_j \label{eq:12-1-7}\end{equation}
Write Eq. \eqref{eq:12-1-7} in vector form.
\begin{equation}\frac{\partial g}{\partial \boldsymbol{u}} = 2\boldsymbol{u} \label{eq:12-1-8}\end{equation}
Apply the chain rule (Ref. 1.26). The derivative of $f$ with respect to $\boldsymbol{u}$ is obtained as follows.
\begin{equation}\frac{\partial f}{\partial \boldsymbol{u}} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial \boldsymbol{u}} \label{eq:12-1-9}\end{equation}
Substitute Eqs. \eqref{eq:12-1-5} and \eqref{eq:12-1-8} into Eq. \eqref{eq:12-1-9}.
\begin{equation}\frac{\partial f}{\partial \boldsymbol{u}} = \frac{1}{2\|\boldsymbol{u}\|_2} \cdot 2\boldsymbol{u} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-10}\end{equation}
Further apply the chain rule (Ref. 1.26) to find the derivative of $f$ with respect to $\boldsymbol{x}$.
\begin{equation}\frac{\partial f}{\partial \boldsymbol{x}} = \frac{\partial f}{\partial \boldsymbol{u}} \cdot \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \label{eq:12-1-11}\end{equation}
Substitute Eqs. \eqref{eq:12-1-2} and \eqref{eq:12-1-10} into Eq. \eqref{eq:12-1-11}.
\begin{equation}\frac{\partial f}{\partial \boldsymbol{x}} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \cdot \boldsymbol{I} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-12}\end{equation}
From Eq. \eqref{eq:12-1-1}, substitute $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$ to obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} \label{eq:12-1-13}\end{equation}
12.2 Derivative of Normalized Vector
Proof
Introduce auxiliary variables.
\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-2-1}\end{equation}
\begin{equation}r = \|\boldsymbol{u}\|_2 \label{eq:12-2-2}\end{equation}
Define the normalized vector $\hat{\boldsymbol{u}}$.
\begin{equation}\hat{\boldsymbol{u}} = \frac{\boldsymbol{u}}{r} \label{eq:12-2-3}\end{equation}
We find the derivative of $\hat{\boldsymbol{u}}$ with respect to $\boldsymbol{x}$. This is the derivative of a vector divided by a scalar, and we apply the quotient rule (Ref. 1.28).
The derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.
\begin{equation}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-2-4}\end{equation}
From Eq. \eqref{eq:12-1-10} in section 12.1, the derivative of $r = \|\boldsymbol{u}\|_2$ with respect to $\boldsymbol{u}$ is
\begin{equation}\frac{\partial r}{\partial \boldsymbol{u}} = \frac{\boldsymbol{u}}{r} = \hat{\boldsymbol{u}} \label{eq:12-2-5}\end{equation}
By the chain rule (Ref. 1.26), we compute the derivative of $r$ with respect to $\boldsymbol{x}$.
\begin{equation}\frac{\partial r}{\partial \boldsymbol{x}} = \frac{\partial r}{\partial \boldsymbol{u}} \cdot \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \hat{\boldsymbol{u}} \cdot \boldsymbol{I} = \frac{\boldsymbol{u}}{r} \label{eq:12-2-6}\end{equation}
Apply the quotient rule (Ref. 1.28) to a vector-scalar quotient. For $\hat{\boldsymbol{u}} = \boldsymbol{u} / r$
\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{1}{r} \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} - \frac{\boldsymbol{u}}{r^2} \left(\frac{\partial r}{\partial \boldsymbol{x}}\right)^\top \label{eq:12-2-7}\end{equation}
Substitute Eqs. \eqref{eq:12-2-4} and \eqref{eq:12-2-6} into Eq. \eqref{eq:12-2-7}.
\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{1}{r} \boldsymbol{I} - \frac{\boldsymbol{u}}{r^2} \left(\frac{\boldsymbol{u}}{r}\right)^\top \label{eq:12-2-8}\end{equation}
Simplify Eq. \eqref{eq:12-2-8}.
\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{\boldsymbol{I}}{r} - \frac{\boldsymbol{u} \boldsymbol{u}^\top}{r^3} \label{eq:12-2-9}\end{equation}
Substitute Eqs. \eqref{eq:12-2-1} and \eqref{eq:12-2-2} to obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \frac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \frac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3} \label{eq:12-2-10}\end{equation}
12.3 Derivative of Squared 2-Norm
Proof
Recall the definition of the squared 2-norm.
\begin{equation}\|\boldsymbol{x}\|_2^2 = \boldsymbol{x}^\top\boldsymbol{x} \label{eq:12-3-1}\end{equation}
Expand Eq. \eqref{eq:12-3-1} component-wise.
\begin{equation}\|\boldsymbol{x}\|_2^2 = \sum_{i=0}^{n-1} x_i^2 \label{eq:12-3-2}\end{equation}
Take the partial derivative of $\|\boldsymbol{x}\|_2^2$ with respect to $x_j$. In the sum of Eq. \eqref{eq:12-3-2}, only the term with $i = j$ contains $x_j$.
\begin{equation}\frac{\partial}{\partial x_j} \|\boldsymbol{x}\|_2^2 = \frac{\partial}{\partial x_j} \sum_{i=0}^{n-1} x_i^2 = \frac{\partial}{\partial x_j} x_j^2 = 2x_j \label{eq:12-3-3}\end{equation}
Equation \eqref{eq:12-3-3} holds for all $j = 0, \ldots, n-1$. In denominator layout, the gradient is grouped as a column vector.
\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = \begin{pmatrix} 2x_0 \\ 2x_1 \\ \vdots \\ 2x_{n-1} \end{pmatrix} = 2\boldsymbol{x} \label{eq:12-3-4}\end{equation}
From Eq. \eqref{eq:12-3-4}, we obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x} \label{eq:12-3-5}\end{equation}
12.4 Squared Frobenius Norm
Proof
Recall the definition of the squared Frobenius norm.
\begin{equation}\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) \label{eq:12-4-1}\end{equation}
Expand Eq. \eqref{eq:12-4-1} component-wise. Let $\boldsymbol{X} \in \mathbb{R}^{m \times n}$
\begin{equation}\|\boldsymbol{X}\|_F^2 = \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} X_{ij}^2 \label{eq:12-4-2}\end{equation}
Take the partial derivative of $\|\boldsymbol{X}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-4-2}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.
\begin{equation}\frac{\partial}{\partial X_{pq}} \|\boldsymbol{X}\|_F^2 = \frac{\partial}{\partial X_{pq}} \sum_{i,j} X_{ij}^2 = \frac{\partial}{\partial X_{pq}} X_{pq}^2 = 2X_{pq} \label{eq:12-4-3}\end{equation}
Equation \eqref{eq:12-4-3} holds for all $(p, q)$, so in matrix form we have
\begin{equation}\left(\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2\right)_{pq} = 2X_{pq} \label{eq:12-4-4}\end{equation}
From Eq. \eqref{eq:12-4-4}, we obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X} \label{eq:12-4-5}\end{equation}
Reference: F.G. Frobenius (1881) "Über die Darstellung der endlichen Gruppen durch lineare Substitutionen". The Frobenius norm is also called the Hilbert-Schmidt norm.
12.5 Frobenius Norm
Proof
The Frobenius norm is the square root of its square.
\begin{equation}\|\boldsymbol{X}\|_F = \sqrt{\|\boldsymbol{X}\|_F^2} \label{eq:12-5-1}\end{equation}
Define an inner function.
\begin{equation}g = \|\boldsymbol{X}\|_F^2 \label{eq:12-5-2}\end{equation}
Then $f = \|\boldsymbol{X}\|_F = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.
\begin{equation}\frac{\partial f}{\partial g} = \frac{\partial}{\partial g} \sqrt{g} = \frac{1}{2\sqrt{g}} = \frac{1}{2\|\boldsymbol{X}\|_F} \label{eq:12-5-3}\end{equation}
From Eq. \eqref{eq:12-4-5} in section 12.4, the derivative of $g = \|\boldsymbol{X}\|_F^2$ is
\begin{equation}\frac{\partial g}{\partial \boldsymbol{X}} = 2\boldsymbol{X} \label{eq:12-5-4}\end{equation}
Apply the chain rule (Ref. 1.26).
\begin{equation}\frac{\partial f}{\partial \boldsymbol{X}} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial \boldsymbol{X}} \label{eq:12-5-5}\end{equation}
Substitute Eqs. \eqref{eq:12-5-3} and \eqref{eq:12-5-4} into Eq. \eqref{eq:12-5-5}.
\begin{equation}\frac{\partial f}{\partial \boldsymbol{X}} = \frac{1}{2\|\boldsymbol{X}\|_F} \cdot 2\boldsymbol{X} = \frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-6}\end{equation}
From Eq. \eqref{eq:12-5-6}, we obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-7}\end{equation}
12.6 Difference of Frobenius Norms
Proof
Introduce an auxiliary variable.
\begin{equation}\boldsymbol{U} = \boldsymbol{X} - \boldsymbol{A} \label{eq:12-6-1}\end{equation}
Since $\boldsymbol{A}$ is a constant matrix, the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$ is the identity transformation. That is, for each component
\begin{equation}\frac{\partial U_{pq}}{\partial X_{rs}} = \delta_{pr}\delta_{qs} \label{eq:12-6-2}\end{equation}
Expand $\|\boldsymbol{U}\|_F^2$ component-wise.
\begin{equation}\|\boldsymbol{U}\|_F^2 = \sum_{i,j} U_{ij}^2 = \sum_{i,j} (X_{ij} - A_{ij})^2 \label{eq:12-6-3}\end{equation}
Take the partial derivative of $\|\boldsymbol{U}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-6-3}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.
\begin{equation}\frac{\partial}{\partial X_{pq}} \|\boldsymbol{U}\|_F^2 = \frac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 \label{eq:12-6-4}\end{equation}
Compute the right-hand side of Eq. \eqref{eq:12-6-4}. The derivative of $(X_{pq} - A_{pq})^2$ by the chain rule (Ref. 1.26) is
\begin{equation}\frac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 = 2(X_{pq} - A_{pq}) \cdot 1 = 2(X_{pq} - A_{pq}) \label{eq:12-6-5}\end{equation}
Equation \eqref{eq:12-6-5} holds for all $(p, q)$, so in matrix form we have
\begin{equation}\left(\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2\right)_{pq} = 2(X_{pq} - A_{pq}) = 2U_{pq} \label{eq:12-6-6}\end{equation}
From Eq. \eqref{eq:12-6-6}, we obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A}) \label{eq:12-6-7}\end{equation}
12.7 Linear Regression Residual (Left Multiplication)
Proof
Introduce an auxiliary variable.
\begin{equation}\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B} \label{eq:12-7-1}\end{equation}
$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. The squared Frobenius norm is expressed using the trace.
\begin{equation}\|\boldsymbol{U}\|_F^2 = \text{tr}(\boldsymbol{U}^\top \boldsymbol{U}) \label{eq:12-7-2}\end{equation}
From Eq. \eqref{eq:12-6-7} in section 12.6, the derivative of $\|\boldsymbol{U}\|_F^2$ with respect to $\boldsymbol{U}$ is
\begin{equation}\frac{\partial}{\partial \boldsymbol{U}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \label{eq:12-7-3}\end{equation}
Next, we compute the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$. From Eq. \eqref{eq:12-7-1}, $U_{kl} = \sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl}$.
\begin{equation}U_{kl} = \sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl} \label{eq:12-7-4}\end{equation}
Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-7-4}, only the term with $n = i$ and $l = j$ contains $X_{ij}$.
\begin{equation}\frac{\partial U_{kl}}{\partial X_{ij}} = A_{ki} \delta_{lj} \label{eq:12-7-5}\end{equation}
Here $\delta_{lj}$ is the Kronecker delta, which is 1 when $l = j$ and 0 otherwise.
Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} \frac{\partial f}{\partial U_{kl}} \frac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-7-6}\end{equation}
Substitute $\frac{\partial f}{\partial U_{kl}} = 2U_{kl}$ from Eq. \eqref{eq:12-7-3} into Eq. \eqref{eq:12-7-6}.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} 2U_{kl} \cdot A_{ki} \delta_{lj} \label{eq:12-7-7}\end{equation}
In Eq. \eqref{eq:12-7-7}, due to $\delta_{lj}$, only the term with $l = j$ survives.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k} 2U_{kj} A_{ki} = 2\sum_{k} A_{ki} U_{kj} \label{eq:12-7-8}\end{equation}
In Eq. \eqref{eq:12-7-8}, $\sum_k A_{ki} U_{kj}$ is the $(i, j)$ element of the matrix product $\boldsymbol{A}^\top \boldsymbol{U}$.
\begin{equation}\sum_{k} A_{ki} U_{kj} = (\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-9}\end{equation}
Substitute Eq. \eqref{eq:12-7-9} into Eq. \eqref{eq:12-7-8}.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-10}\end{equation}
Equation \eqref{eq:12-7-10} holds for all $(i, j)$, so in matrix form we have
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{A}^\top \boldsymbol{U} \label{eq:12-7-11}\end{equation}
Substitute $\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}$ from Eq. \eqref{eq:12-7-1} to obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}) \label{eq:12-7-12}\end{equation}
12.8 Linear Regression Residual (Right Multiplication)
Proof
Introduce an auxiliary variable.
\begin{equation}\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B} \label{eq:12-8-1}\end{equation}
$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. From Eq. \eqref{eq:12-8-1}, $U_{kl} = \sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl}$.
\begin{equation}U_{kl} = \sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl} \label{eq:12-8-2}\end{equation}
Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-8-2}, only the term with $k = i$ and $n = j$ contains $X_{ij}$.
\begin{equation}\frac{\partial U_{kl}}{\partial X_{ij}} = \delta_{ki} A_{jl} \label{eq:12-8-3}\end{equation}
Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} \frac{\partial f}{\partial U_{kl}} \frac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-8-4}\end{equation}
Substitute $\frac{\partial f}{\partial U_{kl}} = 2U_{kl}$ and Eq. \eqref{eq:12-8-3} into Eq. \eqref{eq:12-8-4}.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} 2U_{kl} \cdot \delta_{ki} A_{jl} \label{eq:12-8-5}\end{equation}
In Eq. \eqref{eq:12-8-5}, due to $\delta_{ki}$, only the term with $k = i$ survives.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{l} 2U_{il} A_{jl} = 2\sum_{l} U_{il} A_{jl} \label{eq:12-8-6}\end{equation}
In Eq. \eqref{eq:12-8-6}, $\sum_l U_{il} A_{jl}$ is the $(i, j)$ element of the matrix product $\boldsymbol{U} \boldsymbol{A}^\top$.
\begin{equation}\sum_{l} U_{il} A_{jl} = \sum_{l} U_{il} (A^\top)_{lj} = (\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-7}\end{equation}
Substitute Eq. \eqref{eq:12-8-7} into Eq. \eqref{eq:12-8-6}.
\begin{equation}\frac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-8}\end{equation}
Equation \eqref{eq:12-8-8} holds for all $(i, j)$, so in matrix form we have
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \boldsymbol{A}^\top \label{eq:12-8-9}\end{equation}
Substitute $\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}$ from Eq. \eqref{eq:12-8-1} to obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top \label{eq:12-8-10}\end{equation}
12.9 Regression Weight Gradient
Proof
Define the residual vector.
\begin{equation}\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y} \label{eq:12-9-1}\end{equation}
$\boldsymbol{r} \in \mathbb{R}^N$. The loss function is the squared norm of the residual.
\begin{equation}L = \|\boldsymbol{r}\|_2^2 = \boldsymbol{r}^\top \boldsymbol{r} \label{eq:12-9-2}\end{equation}
Expand Eq. \eqref{eq:12-9-2} component-wise.
\begin{equation}L = \sum_{i=0}^{N-1} r_i^2 \label{eq:12-9-3}\end{equation}
From Eq. \eqref{eq:12-9-1}, each residual component is expressed as
\begin{equation}r_i = \sum_{j=0}^{D-1} X_{ij} w_j - y_i \label{eq:12-9-4}\end{equation}
Take the partial derivative of $r_i$ with respect to $w_k$. In Eq. \eqref{eq:12-9-4}, only the term with $j = k$ contains $w_k$.
\begin{equation}\frac{\partial r_i}{\partial w_k} = X_{ik} \label{eq:12-9-5}\end{equation}
Apply the chain rule to compute the derivative of $L$ with respect to $w_k$.
\begin{equation}\frac{\partial L}{\partial w_k} = \sum_{i=0}^{N-1} \frac{\partial L}{\partial r_i} \frac{\partial r_i}{\partial w_k} \label{eq:12-9-6}\end{equation}
From Eq. \eqref{eq:12-9-3}, $\frac{\partial L}{\partial r_i} = 2r_i$. Substitute this into Eq. \eqref{eq:12-9-6}.
\begin{equation}\frac{\partial L}{\partial w_k} = \sum_{i=0}^{N-1} 2r_i \cdot X_{ik} = 2\sum_{i=0}^{N-1} X_{ik} r_i \label{eq:12-9-7}\end{equation}
In Eq. \eqref{eq:12-9-7}, $\sum_i X_{ik} r_i$ is the $k$-th element of the matrix product $\boldsymbol{X}^\top \boldsymbol{r}$.
\begin{equation}\sum_{i=0}^{N-1} X_{ik} r_i = \sum_{i=0}^{N-1} (X^\top)_{ki} r_i = (\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-8}\end{equation}
Substitute Eq. \eqref{eq:12-9-8} into Eq. \eqref{eq:12-9-7}.
\begin{equation}\frac{\partial L}{\partial w_k} = 2(\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-9}\end{equation}
Equation \eqref{eq:12-9-9} holds for all $k = 0, \ldots, D-1$, so in vector form we have
\begin{equation}\frac{\partial L}{\partial \boldsymbol{w}} = 2\boldsymbol{X}^\top \boldsymbol{r} \label{eq:12-9-10}\end{equation}
Substitute $\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}$ from Eq. \eqref{eq:12-9-1} to obtain the final result.
\begin{equation}\frac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) \label{eq:12-9-11}\end{equation}
References
- Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
- Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
- Matrix calculus - Wikipedia