Proofs Chapter 12: Derivatives of Norms (Basic Formulas)

This chapter proves the differentiation formulas for vector norms and matrix norms (Frobenius norm). Derivatives of norms are fundamental formulas directly applicable to optimization and statistical learning: least squares gradient derivation in linear regression, subgradient computation in L1/L2 regularization, nuclear norm minimization in matrix completion problems, and sparse recovery in signal processing. We cover practical formulas ranging from derivatives of the 2-norm to derivatives of normalized vectors.

Prerequisites: Chapter 2 (Scalar differentiated by vector), Chapter 5 (Differentiation of trace). Related chapters: Chapter 10 (Differentiation of quadratic forms), Chapter 15 (Differentiation of special matrices).

12. Derivatives of Norms

Prerequisites for this chapter
The formulas in this chapter apply under the following conditions unless otherwise noted:
  • All formulas are based on denominator layout
  • Norm $\|\boldsymbol{x}\|$ is not differentiable at $\boldsymbol{x} = \boldsymbol{0}$ (subgradient is used when applicable)
  • The squared norm $\|\boldsymbol{x}\|^2$ is differentiable everywhere

We derive differentiation formulas for vector norms and matrix norms, particularly the Frobenius norm.

12.1 Derivative of Vector 2-Norm

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2}$
Conditions: $\boldsymbol{a}$ is a constant vector, $\boldsymbol{x} \neq \boldsymbol{a}$
Proof

Introduce an auxiliary variable $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$.

\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-1-1}\end{equation}

Since $\boldsymbol{a}$ is a constant vector, the derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.

\begin{equation}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-1-2}\end{equation}

From the definition of the vector 2-norm, $\|\boldsymbol{u}\|_2$ can be expressed as follows.

\begin{equation}\|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}} = \sqrt{\sum_{i=0}^{n-1} u_i^2} \label{eq:12-1-3}\end{equation}

Let $f = \|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}}$. Define the inner function $g = \boldsymbol{u}^\top\boldsymbol{u}$.

\begin{equation}g = \boldsymbol{u}^\top\boldsymbol{u} \label{eq:12-1-4}\end{equation}

Then $f = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.

\begin{equation}\frac{\partial f}{\partial g} = \frac{\partial}{\partial g} \sqrt{g} = \frac{1}{2\sqrt{g}} = \frac{1}{2\|\boldsymbol{u}\|_2} \label{eq:12-1-5}\end{equation}

Next, we compute the derivative of $g = \boldsymbol{u}^\top\boldsymbol{u}$ with respect to $\boldsymbol{u}$. Expand $g$ component-wise.

\begin{equation}g = \sum_{i=0}^{n-1} u_i^2 \label{eq:12-1-6}\end{equation}

Take the partial derivative of $g$ with respect to $u_j$.

\begin{equation}\frac{\partial g}{\partial u_j} = \frac{\partial}{\partial u_j} \sum_{i=0}^{n-1} u_i^2 = 2u_j \label{eq:12-1-7}\end{equation}

Write Eq. \eqref{eq:12-1-7} in vector form.

\begin{equation}\frac{\partial g}{\partial \boldsymbol{u}} = 2\boldsymbol{u} \label{eq:12-1-8}\end{equation}

Apply the chain rule (Ref. 1.26). The derivative of $f$ with respect to $\boldsymbol{u}$ is obtained as follows.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{u}} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial \boldsymbol{u}} \label{eq:12-1-9}\end{equation}

Substitute Eqs. \eqref{eq:12-1-5} and \eqref{eq:12-1-8} into Eq. \eqref{eq:12-1-9}.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{u}} = \frac{1}{2\|\boldsymbol{u}\|_2} \cdot 2\boldsymbol{u} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-10}\end{equation}

Further apply the chain rule (Ref. 1.26) to find the derivative of $f$ with respect to $\boldsymbol{x}$.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{x}} = \frac{\partial f}{\partial \boldsymbol{u}} \cdot \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \label{eq:12-1-11}\end{equation}

Substitute Eqs. \eqref{eq:12-1-2} and \eqref{eq:12-1-10} into Eq. \eqref{eq:12-1-11}.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{x}} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \cdot \boldsymbol{I} = \frac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-12}\end{equation}

From Eq. \eqref{eq:12-1-1}, substitute $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$ to obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} \label{eq:12-1-13}\end{equation}

Note: The right-hand side of Eq. \eqref{eq:12-1-13} is the normalized form of the vector $\boldsymbol{x} - \boldsymbol{a}$. When $\boldsymbol{a} = \boldsymbol{0}$, we have $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2 = \displaystyle\frac{\boldsymbol{x}}{\|\boldsymbol{x}\|_2}$, which is equal to the unit vector $\hat{\boldsymbol{x}}$.

12.2 Derivative of Normalized Vector

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \displaystyle\frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \displaystyle\frac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \displaystyle\frac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3}$
Conditions: $\boldsymbol{a}$ is a constant vector, $\boldsymbol{x} \neq \boldsymbol{a}$
Proof

Introduce auxiliary variables.

\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-2-1}\end{equation}

\begin{equation}r = \|\boldsymbol{u}\|_2 \label{eq:12-2-2}\end{equation}

Define the normalized vector $\hat{\boldsymbol{u}}$.

\begin{equation}\hat{\boldsymbol{u}} = \frac{\boldsymbol{u}}{r} \label{eq:12-2-3}\end{equation}

We find the derivative of $\hat{\boldsymbol{u}}$ with respect to $\boldsymbol{x}$. This is the derivative of a vector divided by a scalar, and we apply the quotient rule (Ref. 1.28).

The derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.

\begin{equation}\frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-2-4}\end{equation}

From Eq. \eqref{eq:12-1-10} in section 12.1, the derivative of $r = \|\boldsymbol{u}\|_2$ with respect to $\boldsymbol{u}$ is

\begin{equation}\frac{\partial r}{\partial \boldsymbol{u}} = \frac{\boldsymbol{u}}{r} = \hat{\boldsymbol{u}} \label{eq:12-2-5}\end{equation}

By the chain rule (Ref. 1.26), we compute the derivative of $r$ with respect to $\boldsymbol{x}$.

\begin{equation}\frac{\partial r}{\partial \boldsymbol{x}} = \frac{\partial r}{\partial \boldsymbol{u}} \cdot \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \hat{\boldsymbol{u}} \cdot \boldsymbol{I} = \frac{\boldsymbol{u}}{r} \label{eq:12-2-6}\end{equation}

Apply the quotient rule (Ref. 1.28) to a vector-scalar quotient. For $\hat{\boldsymbol{u}} = \boldsymbol{u} / r$

\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{1}{r} \frac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} - \frac{\boldsymbol{u}}{r^2} \left(\frac{\partial r}{\partial \boldsymbol{x}}\right)^\top \label{eq:12-2-7}\end{equation}

Substitute Eqs. \eqref{eq:12-2-4} and \eqref{eq:12-2-6} into Eq. \eqref{eq:12-2-7}.

\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{1}{r} \boldsymbol{I} - \frac{\boldsymbol{u}}{r^2} \left(\frac{\boldsymbol{u}}{r}\right)^\top \label{eq:12-2-8}\end{equation}

Simplify Eq. \eqref{eq:12-2-8}.

\begin{equation}\frac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \frac{\boldsymbol{I}}{r} - \frac{\boldsymbol{u} \boldsymbol{u}^\top}{r^3} \label{eq:12-2-9}\end{equation}

Substitute Eqs. \eqref{eq:12-2-1} and \eqref{eq:12-2-2} to obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \frac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \frac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \frac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3} \label{eq:12-2-10}\end{equation}

Note: Equation \eqref{eq:12-2-10} can be written as $\displaystyle\frac{1}{r}(\boldsymbol{I} - \hat{\boldsymbol{u}}\hat{\boldsymbol{u}}^\top)$, where $\boldsymbol{I} - \hat{\boldsymbol{u}}\hat{\boldsymbol{u}}^\top$ is the projection matrix onto the subspace orthogonal to $\hat{\boldsymbol{u}}$. This means that changes in the normalized vector occur only in the orthogonal direction to the original vector.

12.3 Derivative of Squared 2-Norm

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x}$
Conditions: $\boldsymbol{x} \in \mathbb{R}^n$
Proof

Recall the definition of the squared 2-norm.

\begin{equation}\|\boldsymbol{x}\|_2^2 = \boldsymbol{x}^\top\boldsymbol{x} \label{eq:12-3-1}\end{equation}

Expand Eq. \eqref{eq:12-3-1} component-wise.

\begin{equation}\|\boldsymbol{x}\|_2^2 = \sum_{i=0}^{n-1} x_i^2 \label{eq:12-3-2}\end{equation}

Take the partial derivative of $\|\boldsymbol{x}\|_2^2$ with respect to $x_j$. In the sum of Eq. \eqref{eq:12-3-2}, only the term with $i = j$ contains $x_j$.

\begin{equation}\frac{\partial}{\partial x_j} \|\boldsymbol{x}\|_2^2 = \frac{\partial}{\partial x_j} \sum_{i=0}^{n-1} x_i^2 = \frac{\partial}{\partial x_j} x_j^2 = 2x_j \label{eq:12-3-3}\end{equation}

Equation \eqref{eq:12-3-3} holds for all $j = 0, \ldots, n-1$. In denominator layout, the gradient is grouped as a column vector.

\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = \begin{pmatrix} 2x_0 \\ 2x_1 \\ \vdots \\ 2x_{n-1} \end{pmatrix} = 2\boldsymbol{x} \label{eq:12-3-4}\end{equation}

From Eq. \eqref{eq:12-3-4}, we obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x} \label{eq:12-3-5}\end{equation}

Note: Equation \eqref{eq:12-3-5} can also be derived as a special case of 12.1. Applying the chain rule to $\|\boldsymbol{x}\|_2^2 = (\|\boldsymbol{x}\|_2)^2$ gives $\displaystyle\frac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\|\boldsymbol{x}\|_2 \cdot \displaystyle\frac{\boldsymbol{x}}{\|\boldsymbol{x}\|_2} = 2\boldsymbol{x}$.

12.4 Squared Frobenius Norm

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$
Conditions: $\boldsymbol{X}$ is a real matrix, $\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = \sum_{i,j} X_{ij}^2$
Proof

Recall the definition of the squared Frobenius norm.

\begin{equation}\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) \label{eq:12-4-1}\end{equation}

Expand Eq. \eqref{eq:12-4-1} component-wise. Let $\boldsymbol{X} \in \mathbb{R}^{m \times n}$

\begin{equation}\|\boldsymbol{X}\|_F^2 = \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} X_{ij}^2 \label{eq:12-4-2}\end{equation}

Take the partial derivative of $\|\boldsymbol{X}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-4-2}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.

\begin{equation}\frac{\partial}{\partial X_{pq}} \|\boldsymbol{X}\|_F^2 = \frac{\partial}{\partial X_{pq}} \sum_{i,j} X_{ij}^2 = \frac{\partial}{\partial X_{pq}} X_{pq}^2 = 2X_{pq} \label{eq:12-4-3}\end{equation}

Equation \eqref{eq:12-4-3} holds for all $(p, q)$, so in matrix form we have

\begin{equation}\left(\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2\right)_{pq} = 2X_{pq} \label{eq:12-4-4}\end{equation}

From Eq. \eqref{eq:12-4-4}, we obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X} \label{eq:12-4-5}\end{equation}

Note: Equation \eqref{eq:12-4-5} can also be derived from Ref. 3.8 as $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$. The squared Frobenius norm is $\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top)$, and both are equal by the cyclic property of the trace.
Reference: F.G. Frobenius (1881) "Über die Darstellung der endlichen Gruppen durch lineare Substitutionen". The Frobenius norm is also called the Hilbert-Schmidt norm.

12.5 Frobenius Norm

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \displaystyle\frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F}$
Conditions: $\boldsymbol{X} \neq \boldsymbol{O}$ (non-zero matrix)
Proof

The Frobenius norm is the square root of its square.

\begin{equation}\|\boldsymbol{X}\|_F = \sqrt{\|\boldsymbol{X}\|_F^2} \label{eq:12-5-1}\end{equation}

Define an inner function.

\begin{equation}g = \|\boldsymbol{X}\|_F^2 \label{eq:12-5-2}\end{equation}

Then $f = \|\boldsymbol{X}\|_F = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.

\begin{equation}\frac{\partial f}{\partial g} = \frac{\partial}{\partial g} \sqrt{g} = \frac{1}{2\sqrt{g}} = \frac{1}{2\|\boldsymbol{X}\|_F} \label{eq:12-5-3}\end{equation}

From Eq. \eqref{eq:12-4-5} in section 12.4, the derivative of $g = \|\boldsymbol{X}\|_F^2$ is

\begin{equation}\frac{\partial g}{\partial \boldsymbol{X}} = 2\boldsymbol{X} \label{eq:12-5-4}\end{equation}

Apply the chain rule (Ref. 1.26).

\begin{equation}\frac{\partial f}{\partial \boldsymbol{X}} = \frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial \boldsymbol{X}} \label{eq:12-5-5}\end{equation}

Substitute Eqs. \eqref{eq:12-5-3} and \eqref{eq:12-5-4} into Eq. \eqref{eq:12-5-5}.

\begin{equation}\frac{\partial f}{\partial \boldsymbol{X}} = \frac{1}{2\|\boldsymbol{X}\|_F} \cdot 2\boldsymbol{X} = \frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-6}\end{equation}

From Eq. \eqref{eq:12-5-6}, we obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \frac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-7}\end{equation}

Note: Equation \eqref{eq:12-5-7} is the matrix version of the vector 2-norm derivative formula \eqref{eq:12-1-13} (with $\boldsymbol{a} = \boldsymbol{0}$). The gradient of the Frobenius norm is the normalized matrix.

12.6 Difference of Frobenius Norms

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A})$
Conditions: $\boldsymbol{A}$ is a constant matrix
Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{X} - \boldsymbol{A} \label{eq:12-6-1}\end{equation}

Since $\boldsymbol{A}$ is a constant matrix, the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$ is the identity transformation. That is, for each component

\begin{equation}\frac{\partial U_{pq}}{\partial X_{rs}} = \delta_{pr}\delta_{qs} \label{eq:12-6-2}\end{equation}

Expand $\|\boldsymbol{U}\|_F^2$ component-wise.

\begin{equation}\|\boldsymbol{U}\|_F^2 = \sum_{i,j} U_{ij}^2 = \sum_{i,j} (X_{ij} - A_{ij})^2 \label{eq:12-6-3}\end{equation}

Take the partial derivative of $\|\boldsymbol{U}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-6-3}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.

\begin{equation}\frac{\partial}{\partial X_{pq}} \|\boldsymbol{U}\|_F^2 = \frac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 \label{eq:12-6-4}\end{equation}

Compute the right-hand side of Eq. \eqref{eq:12-6-4}. The derivative of $(X_{pq} - A_{pq})^2$ by the chain rule (Ref. 1.26) is

\begin{equation}\frac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 = 2(X_{pq} - A_{pq}) \cdot 1 = 2(X_{pq} - A_{pq}) \label{eq:12-6-5}\end{equation}

Equation \eqref{eq:12-6-5} holds for all $(p, q)$, so in matrix form we have

\begin{equation}\left(\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2\right)_{pq} = 2(X_{pq} - A_{pq}) = 2U_{pq} \label{eq:12-6-6}\end{equation}

From Eq. \eqref{eq:12-6-6}, we obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A}) \label{eq:12-6-7}\end{equation}

Note: Equation \eqref{eq:12-6-7} is important in the matrix version of least squares. The condition for minimizing $\|\boldsymbol{X} - \boldsymbol{A}\|_F^2$ is that the gradient equals zero, i.e., $\boldsymbol{X} = \boldsymbol{A}$. This is consistent with the fact that $\|\boldsymbol{X} - \boldsymbol{A}\|_F^2 \geq 0$ achieves its minimum value of 0 when $\boldsymbol{X} = \boldsymbol{A}$.

12.7 Linear Regression Residual (Left Multiplication)

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B})$
Conditions: $\boldsymbol{A}$ is $M \times N$ matrix, $\boldsymbol{X}$ is $N \times P$ matrix, $\boldsymbol{B}$ is $M \times P$ constant matrix
Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B} \label{eq:12-7-1}\end{equation}

$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. The squared Frobenius norm is expressed using the trace.

\begin{equation}\|\boldsymbol{U}\|_F^2 = \text{tr}(\boldsymbol{U}^\top \boldsymbol{U}) \label{eq:12-7-2}\end{equation}

From Eq. \eqref{eq:12-6-7} in section 12.6, the derivative of $\|\boldsymbol{U}\|_F^2$ with respect to $\boldsymbol{U}$ is

\begin{equation}\frac{\partial}{\partial \boldsymbol{U}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \label{eq:12-7-3}\end{equation}

Next, we compute the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$. From Eq. \eqref{eq:12-7-1}, $U_{kl} = \sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl}$.

\begin{equation}U_{kl} = \sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl} \label{eq:12-7-4}\end{equation}

Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-7-4}, only the term with $n = i$ and $l = j$ contains $X_{ij}$.

\begin{equation}\frac{\partial U_{kl}}{\partial X_{ij}} = A_{ki} \delta_{lj} \label{eq:12-7-5}\end{equation}

Here $\delta_{lj}$ is the Kronecker delta, which is 1 when $l = j$ and 0 otherwise.

Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} \frac{\partial f}{\partial U_{kl}} \frac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-7-6}\end{equation}

Substitute $\frac{\partial f}{\partial U_{kl}} = 2U_{kl}$ from Eq. \eqref{eq:12-7-3} into Eq. \eqref{eq:12-7-6}.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} 2U_{kl} \cdot A_{ki} \delta_{lj} \label{eq:12-7-7}\end{equation}

In Eq. \eqref{eq:12-7-7}, due to $\delta_{lj}$, only the term with $l = j$ survives.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k} 2U_{kj} A_{ki} = 2\sum_{k} A_{ki} U_{kj} \label{eq:12-7-8}\end{equation}

In Eq. \eqref{eq:12-7-8}, $\sum_k A_{ki} U_{kj}$ is the $(i, j)$ element of the matrix product $\boldsymbol{A}^\top \boldsymbol{U}$.

\begin{equation}\sum_{k} A_{ki} U_{kj} = (\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-9}\end{equation}

Substitute Eq. \eqref{eq:12-7-9} into Eq. \eqref{eq:12-7-8}.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-10}\end{equation}

Equation \eqref{eq:12-7-10} holds for all $(i, j)$, so in matrix form we have

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{A}^\top \boldsymbol{U} \label{eq:12-7-11}\end{equation}

Substitute $\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}$ from Eq. \eqref{eq:12-7-1} to obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}) \label{eq:12-7-12}\end{equation}

Note: Equation \eqref{eq:12-7-12} is used to derive the least squares solution for linear regression $\boldsymbol{A}\boldsymbol{X} \approx \boldsymbol{B}$. Setting the gradient to zero yields $\boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{X} = \boldsymbol{A}^\top \boldsymbol{B}$, and if $\boldsymbol{A}^\top \boldsymbol{A}$ is invertible, we obtain the normal equation $\boldsymbol{X} = (\boldsymbol{A}^\top \boldsymbol{A})^{-1} \boldsymbol{A}^\top \boldsymbol{B}$.

12.8 Linear Regression Residual (Right Multiplication)

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top$
Conditions: $\boldsymbol{X}$ is $M \times N$ matrix, $\boldsymbol{A}$ is $N \times P$ constant matrix, $\boldsymbol{B}$ is $M \times P$ constant matrix
Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B} \label{eq:12-8-1}\end{equation}

$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. From Eq. \eqref{eq:12-8-1}, $U_{kl} = \sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl}$.

\begin{equation}U_{kl} = \sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl} \label{eq:12-8-2}\end{equation}

Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-8-2}, only the term with $k = i$ and $n = j$ contains $X_{ij}$.

\begin{equation}\frac{\partial U_{kl}}{\partial X_{ij}} = \delta_{ki} A_{jl} \label{eq:12-8-3}\end{equation}

Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} \frac{\partial f}{\partial U_{kl}} \frac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-8-4}\end{equation}

Substitute $\frac{\partial f}{\partial U_{kl}} = 2U_{kl}$ and Eq. \eqref{eq:12-8-3} into Eq. \eqref{eq:12-8-4}.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{k,l} 2U_{kl} \cdot \delta_{ki} A_{jl} \label{eq:12-8-5}\end{equation}

In Eq. \eqref{eq:12-8-5}, due to $\delta_{ki}$, only the term with $k = i$ survives.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = \sum_{l} 2U_{il} A_{jl} = 2\sum_{l} U_{il} A_{jl} \label{eq:12-8-6}\end{equation}

In Eq. \eqref{eq:12-8-6}, $\sum_l U_{il} A_{jl}$ is the $(i, j)$ element of the matrix product $\boldsymbol{U} \boldsymbol{A}^\top$.

\begin{equation}\sum_{l} U_{il} A_{jl} = \sum_{l} U_{il} (A^\top)_{lj} = (\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-7}\end{equation}

Substitute Eq. \eqref{eq:12-8-7} into Eq. \eqref{eq:12-8-6}.

\begin{equation}\frac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-8}\end{equation}

Equation \eqref{eq:12-8-8} holds for all $(i, j)$, so in matrix form we have

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \boldsymbol{A}^\top \label{eq:12-8-9}\end{equation}

Substitute $\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}$ from Eq. \eqref{eq:12-8-1} to obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top \label{eq:12-8-10}\end{equation}

Note: Equation \eqref{eq:12-8-10} can be viewed as the transpose version of 12.7. Depending on the positions of $\boldsymbol{A}$ and $\boldsymbol{X}$, whether $\boldsymbol{A}^\top$ appears on the left or right is determined. If $\boldsymbol{A}$ is on the left of $\boldsymbol{X}$, then $\boldsymbol{A}^\top$ appears on the left; if $\boldsymbol{A}$ is on the right of $\boldsymbol{X}$, then $\boldsymbol{A}^\top$ appears on the right.

12.9 Regression Weight Gradient

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$
Conditions: $\boldsymbol{X}$ is $N \times D$ data matrix, $\boldsymbol{w}$ is $D$-dimensional weight vector, $\boldsymbol{y}$ is $N$-dimensional target vector
Proof

Define the residual vector.

\begin{equation}\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y} \label{eq:12-9-1}\end{equation}

$\boldsymbol{r} \in \mathbb{R}^N$. The loss function is the squared norm of the residual.

\begin{equation}L = \|\boldsymbol{r}\|_2^2 = \boldsymbol{r}^\top \boldsymbol{r} \label{eq:12-9-2}\end{equation}

Expand Eq. \eqref{eq:12-9-2} component-wise.

\begin{equation}L = \sum_{i=0}^{N-1} r_i^2 \label{eq:12-9-3}\end{equation}

From Eq. \eqref{eq:12-9-1}, each residual component is expressed as

\begin{equation}r_i = \sum_{j=0}^{D-1} X_{ij} w_j - y_i \label{eq:12-9-4}\end{equation}

Take the partial derivative of $r_i$ with respect to $w_k$. In Eq. \eqref{eq:12-9-4}, only the term with $j = k$ contains $w_k$.

\begin{equation}\frac{\partial r_i}{\partial w_k} = X_{ik} \label{eq:12-9-5}\end{equation}

Apply the chain rule to compute the derivative of $L$ with respect to $w_k$.

\begin{equation}\frac{\partial L}{\partial w_k} = \sum_{i=0}^{N-1} \frac{\partial L}{\partial r_i} \frac{\partial r_i}{\partial w_k} \label{eq:12-9-6}\end{equation}

From Eq. \eqref{eq:12-9-3}, $\frac{\partial L}{\partial r_i} = 2r_i$. Substitute this into Eq. \eqref{eq:12-9-6}.

\begin{equation}\frac{\partial L}{\partial w_k} = \sum_{i=0}^{N-1} 2r_i \cdot X_{ik} = 2\sum_{i=0}^{N-1} X_{ik} r_i \label{eq:12-9-7}\end{equation}

In Eq. \eqref{eq:12-9-7}, $\sum_i X_{ik} r_i$ is the $k$-th element of the matrix product $\boldsymbol{X}^\top \boldsymbol{r}$.

\begin{equation}\sum_{i=0}^{N-1} X_{ik} r_i = \sum_{i=0}^{N-1} (X^\top)_{ki} r_i = (\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-8}\end{equation}

Substitute Eq. \eqref{eq:12-9-8} into Eq. \eqref{eq:12-9-7}.

\begin{equation}\frac{\partial L}{\partial w_k} = 2(\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-9}\end{equation}

Equation \eqref{eq:12-9-9} holds for all $k = 0, \ldots, D-1$, so in vector form we have

\begin{equation}\frac{\partial L}{\partial \boldsymbol{w}} = 2\boldsymbol{X}^\top \boldsymbol{r} \label{eq:12-9-10}\end{equation}

Substitute $\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}$ from Eq. \eqref{eq:12-9-1} to obtain the final result.

\begin{equation}\frac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) \label{eq:12-9-11}\end{equation}

Note: Equation \eqref{eq:12-9-11} is the most fundamental gradient in machine learning, used in gradient descent for linear regression. Setting the gradient to zero yields $\boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{w} = \boldsymbol{X}^\top \boldsymbol{y}$, and if $\boldsymbol{X}^\top \boldsymbol{X}$ is invertible, we obtain the optimal solution $\boldsymbol{w}^* = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}$. When $\boldsymbol{X}^\top \boldsymbol{X}$ is not invertible, the pseudo-inverse $\boldsymbol{X}^+ = (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top$ is used.

References

  • Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
  • Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
  • Matrix calculus - Wikipedia