Why is the vector norm not differentiable at x=0?

The derivative formula of the norm contains ||x|| in the denominator. When x=0, the denominator becomes zero, making the derivative undefined. In contrast, the squared norm ||x||² is differentiable everywhere, including at x=0.

What is the difference between denominator and numerator layout?

Denominator layout is a convention where the result of differentiation is shaped according to the shape of the denominator variable. All formulas in this chapter follow denominator layout. Results can differ from numerator layout, so it is important to clarify which convention is being used.

What is the relationship between Frobenius norm and 2-norm?

The 2-norm is defined for vectors, while the Frobenius norm is its matrix counterpart. The Frobenius norm is the square root of the sum of squares of all matrix elements, defined as ||X||F = √tr(X⊤X) = √∑Xᵢⱼ².

Why is the derivative of a normalized vector complicated?

For the normalized vector ŷ = y/||y||, both the numerator (y) and denominator (||y||) change with respect to the variable. Therefore, the quotient rule must be applied, and the result contains the projection matrix I - ŷŷ⊤.

What is the difference between 12.7 and 12.8?

12.7 is for left multiplication ||AX - B||F² with gradient 2A⊤(AX - B). 12.8 is for right multiplication ||XA - B||F² with gradient 2(XA - B)A⊤. The position of A determines whether A⊤ appears on the left or right in the result.

Proofs Chapter 12: Derivatives of Norms (Basic Formulas)

This chapter proves the differentiation formulas for vector norms and matrix norms (Frobenius norm). Derivatives of norms are fundamental formulas directly applicable to optimization and statistical learning: least squares gradient derivation in linear regression, subgradient computation in L1/L2 regularization, nuclear norm minimization in matrix completion problems, and sparse recovery in signal processing. We cover practical formulas ranging from derivatives of the 2-norm to derivatives of normalized vectors.

Prerequisites: Chapter 2 (Scalar differentiated by vector), Chapter 5 (Differentiation of trace). Related chapters: Chapter 10 (Differentiation of quadratic forms), Chapter 15 (Differentiation of special matrices).

12. Derivatives of Norms

Prerequisites for this chapter
The formulas in this chapter apply under the following conditions unless otherwise noted:

All formulas are based on denominator layout
Norm $\|\boldsymbol{x}\|$ is not differentiable at $\boldsymbol{x} = \boldsymbol{0}$ (subgradient is used when applicable)
The squared norm $\|\boldsymbol{x}\|^2$ is differentiable everywhere

We derive differentiation formulas for vector norms and matrix norms, particularly the Frobenius norm.

12.1 Derivative of Vector 2-Norm

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \displaystyle\dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2}$

Conditions: $\boldsymbol{a}$ is a constant vector, $\boldsymbol{x} \neq \boldsymbol{a}$

Proof

Introduce an auxiliary variable $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$.

\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-1-1}\end{equation}

Since $\boldsymbol{a}$ is a constant vector, the derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.

\begin{equation}\dfrac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-1-2}\end{equation}

From the definition of the vector 2-norm, $\|\boldsymbol{u}\|_2$ can be expressed as follows.

\begin{equation}\|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}} = \sqrt{\displaystyle\sum_{i=0}^{n-1} u_i^2} \label{eq:12-1-3}\end{equation}

Let $f = \|\boldsymbol{u}\|_2 = \sqrt{\boldsymbol{u}^\top\boldsymbol{u}}$. Define the inner function $g = \boldsymbol{u}^\top\boldsymbol{u}$.

\begin{equation}g = \boldsymbol{u}^\top\boldsymbol{u} \label{eq:12-1-4}\end{equation}

Then $f = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.

\begin{equation}\dfrac{\partial f}{\partial g} = \dfrac{\partial}{\partial g} \sqrt{g} = \dfrac{1}{2\sqrt{g}} = \dfrac{1}{2\|\boldsymbol{u}\|_2} \label{eq:12-1-5}\end{equation}

Next, we compute the derivative of $g = \boldsymbol{u}^\top\boldsymbol{u}$ with respect to $\boldsymbol{u}$. Expand $g$ component-wise.

\begin{equation}g = \displaystyle\sum_{i=0}^{n-1} u_i^2 \label{eq:12-1-6}\end{equation}

Take the partial derivative of $g$ with respect to $u_j$.

\begin{equation}\dfrac{\partial g}{\partial u_j} = \dfrac{\partial}{\partial u_j} \displaystyle\sum_{i=0}^{n-1} u_i^2 = 2u_j \label{eq:12-1-7}\end{equation}

Write Eq. \eqref{eq:12-1-7} in vector form.

\begin{equation}\dfrac{\partial g}{\partial \boldsymbol{u}} = 2\boldsymbol{u} \label{eq:12-1-8}\end{equation}

Apply the chain rule (Ref. 1.26). The derivative of $f$ with respect to $\boldsymbol{u}$ is obtained as follows.

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{u}} = \dfrac{\partial f}{\partial g} \cdot \dfrac{\partial g}{\partial \boldsymbol{u}} \label{eq:12-1-9}\end{equation}

Substitute Eqs. \eqref{eq:12-1-5} and \eqref{eq:12-1-8} into Eq. \eqref{eq:12-1-9}.

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{u}} = \dfrac{1}{2\|\boldsymbol{u}\|_2} \cdot 2\boldsymbol{u} = \dfrac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-10}\end{equation}

Further apply the chain rule (Ref. 1.26) to find the derivative of $f$ with respect to $\boldsymbol{x}$.

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{x}} = \dfrac{\partial f}{\partial \boldsymbol{u}} \cdot \dfrac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} \label{eq:12-1-11}\end{equation}

Substitute Eqs. \eqref{eq:12-1-2} and \eqref{eq:12-1-10} into Eq. \eqref{eq:12-1-11}.

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{x}} = \dfrac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \cdot \boldsymbol{I} = \dfrac{\boldsymbol{u}}{\|\boldsymbol{u}\|_2} \label{eq:12-1-12}\end{equation}

From Eq. \eqref{eq:12-1-1}, substitute $\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a}$ to obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|_2 = \dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} \label{eq:12-1-13}\end{equation}

Note: The right-hand side of Eq. \eqref{eq:12-1-13} is the normalized form of the vector $\boldsymbol{x} - \boldsymbol{a}$. When $\boldsymbol{a} = \boldsymbol{0}$, we have $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2 = \displaystyle\dfrac{\boldsymbol{x}}{\|\boldsymbol{x}\|_2}$, which is equal to the unit vector $\hat{\boldsymbol{x}}$.

12.2 Derivative of Normalized Vector

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \displaystyle\dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \displaystyle\dfrac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \displaystyle\dfrac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3}$

Conditions: $\boldsymbol{a}$ is a constant vector, $\boldsymbol{x} \neq \boldsymbol{a}$

Proof

Introduce auxiliary variables.

\begin{equation}\boldsymbol{u} = \boldsymbol{x} - \boldsymbol{a} \label{eq:12-2-1}\end{equation}

\begin{equation}r = \|\boldsymbol{u}\|_2 \label{eq:12-2-2}\end{equation}

Define the normalized vector $\hat{\boldsymbol{u}}$.

\begin{equation}\hat{\boldsymbol{u}} = \dfrac{\boldsymbol{u}}{r} \label{eq:12-2-3}\end{equation}

We find the derivative of $\hat{\boldsymbol{u}}$ with respect to $\boldsymbol{x}$. This is the derivative of a vector divided by a scalar, and we apply the quotient rule (Ref. 1.28).

The derivative of $\boldsymbol{u}$ with respect to $\boldsymbol{x}$ is the identity matrix.

\begin{equation}\dfrac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:12-2-4}\end{equation}

From Eq. \eqref{eq:12-1-10} in section 12.1, the derivative of $r = \|\boldsymbol{u}\|_2$ with respect to $\boldsymbol{u}$ is

\begin{equation}\dfrac{\partial r}{\partial \boldsymbol{u}} = \dfrac{\boldsymbol{u}}{r} = \hat{\boldsymbol{u}} \label{eq:12-2-5}\end{equation}

By the chain rule (Ref. 1.26), we compute the derivative of $r$ with respect to $\boldsymbol{x}$.

\begin{equation}\dfrac{\partial r}{\partial \boldsymbol{x}} = \dfrac{\partial r}{\partial \boldsymbol{u}} \cdot \dfrac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} = \hat{\boldsymbol{u}} \cdot \boldsymbol{I} = \dfrac{\boldsymbol{u}}{r} \label{eq:12-2-6}\end{equation}

Apply the quotient rule (Ref. 1.28) to a vector-scalar quotient. For $\hat{\boldsymbol{u}} = \boldsymbol{u} / r$

\begin{equation}\dfrac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \dfrac{1}{r} \dfrac{\partial \boldsymbol{u}}{\partial \boldsymbol{x}} - \dfrac{\boldsymbol{u}}{r^2} \left(\dfrac{\partial r}{\partial \boldsymbol{x}}\right)^\top \label{eq:12-2-7}\end{equation}

Substitute Eqs. \eqref{eq:12-2-4} and \eqref{eq:12-2-6} into Eq. \eqref{eq:12-2-7}.

\begin{equation}\dfrac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \dfrac{1}{r} \boldsymbol{I} - \dfrac{\boldsymbol{u}}{r^2} \left(\dfrac{\boldsymbol{u}}{r}\right)^\top \label{eq:12-2-8}\end{equation}

Simplify Eq. \eqref{eq:12-2-8}.

\begin{equation}\dfrac{\partial \hat{\boldsymbol{u}}}{\partial \boldsymbol{x}} = \dfrac{\boldsymbol{I}}{r} - \dfrac{\boldsymbol{u} \boldsymbol{u}^\top}{r^3} \label{eq:12-2-9}\end{equation}

Substitute Eqs. \eqref{eq:12-2-1} and \eqref{eq:12-2-2} to obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} = \dfrac{\boldsymbol{I}}{\|\boldsymbol{x} - \boldsymbol{a}\|_2} - \dfrac{(\boldsymbol{x} - \boldsymbol{a})(\boldsymbol{x} - \boldsymbol{a})^\top}{\|\boldsymbol{x} - \boldsymbol{a}\|_2^3} \label{eq:12-2-10}\end{equation}

Note: Equation \eqref{eq:12-2-10} can be written as $\displaystyle\dfrac{1}{r}(\boldsymbol{I} - \hat{\boldsymbol{u}}\hat{\boldsymbol{u}}^\top)$, where $\boldsymbol{I} - \hat{\boldsymbol{u}}\hat{\boldsymbol{u}}^\top$ is the projection matrix onto the subspace orthogonal to $\hat{\boldsymbol{u}}$. This means that changes in the normalized vector occur only in the orthogonal direction to the original vector.

12.3 Derivative of Squared 2-Norm

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x}$

Conditions: $\boldsymbol{x} \in \mathbb{R}^n$

Proof

Recall the definition of the squared 2-norm.

\begin{equation}\|\boldsymbol{x}\|_2^2 = \boldsymbol{x}^\top\boldsymbol{x} \label{eq:12-3-1}\end{equation}

Expand Eq. \eqref{eq:12-3-1} component-wise.

\begin{equation}\|\boldsymbol{x}\|_2^2 = \displaystyle\sum_{i=0}^{n-1} x_i^2 \label{eq:12-3-2}\end{equation}

Take the partial derivative of $\|\boldsymbol{x}\|_2^2$ with respect to $x_j$. In the sum of Eq. \eqref{eq:12-3-2}, only the term with $i = j$ contains $x_j$.

\begin{equation}\dfrac{\partial}{\partial x_j} \|\boldsymbol{x}\|_2^2 = \dfrac{\partial}{\partial x_j} \displaystyle\sum_{i=0}^{n-1} x_i^2 = \dfrac{\partial}{\partial x_j} x_j^2 = 2x_j \label{eq:12-3-3}\end{equation}

Equation \eqref{eq:12-3-3} holds for all $j = 0, \ldots, n-1$. In denominator layout, the gradient is grouped as a column vector.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = \begin{pmatrix} 2x_0 \\ 2x_1 \\ \vdots \\ 2x_{n-1} \end{pmatrix} = 2\boldsymbol{x} \label{eq:12-3-4}\end{equation}

From Eq. \eqref{eq:12-3-4}, we obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\boldsymbol{x} \label{eq:12-3-5}\end{equation}

Note: Equation \eqref{eq:12-3-5} can also be derived as a special case of 12.1. Applying the chain rule to $\|\boldsymbol{x}\|_2^2 = (\|\boldsymbol{x}\|_2)^2$ gives $\displaystyle\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x}\|_2^2 = 2\|\boldsymbol{x}\|_2 \cdot \displaystyle\dfrac{\boldsymbol{x}}{\|\boldsymbol{x}\|_2} = 2\boldsymbol{x}$.

12.4 Squared Frobenius Norm

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X}$

Conditions: $\boldsymbol{X}$ is a real matrix, $\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = \displaystyle\sum_{i,j} X_{ij}^2$

Proof

Recall the definition of the squared Frobenius norm.

\begin{equation}\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) \label{eq:12-4-1}\end{equation}

Expand Eq. \eqref{eq:12-4-1} component-wise. Let $\boldsymbol{X} \in \mathbb{R}^{m \times n}$

\begin{equation}\|\boldsymbol{X}\|_F^2 = \displaystyle\sum_{i=0}^{m-1} \displaystyle\sum_{j=0}^{n-1} X_{ij}^2 \label{eq:12-4-2}\end{equation}

Take the partial derivative of $\|\boldsymbol{X}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-4-2}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.

\begin{equation}\dfrac{\partial}{\partial X_{pq}} \|\boldsymbol{X}\|_F^2 = \dfrac{\partial}{\partial X_{pq}} \displaystyle\sum_{i,j} X_{ij}^2 = \dfrac{\partial}{\partial X_{pq}} X_{pq}^2 = 2X_{pq} \label{eq:12-4-3}\end{equation}

Equation \eqref{eq:12-4-3} holds for all $(p, q)$, so in matrix form we have

\begin{equation}\left(\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2\right)_{pq} = 2X_{pq} \label{eq:12-4-4}\end{equation}

From Eq. \eqref{eq:12-4-4}, we obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F^2 = 2\boldsymbol{X} \label{eq:12-4-5}\end{equation}

Note: Equation \eqref{eq:12-4-5} can also be derived from Ref. 3.8 as $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top) = 2\boldsymbol{X}$. The squared Frobenius norm is $\|\boldsymbol{X}\|_F^2 = \text{tr}(\boldsymbol{X}^\top\boldsymbol{X}) = \text{tr}(\boldsymbol{X}\boldsymbol{X}^\top)$, and both are equal by the cyclic property of the trace.
Reference: F.G. Frobenius (1881) "Über die Darstellung der endlichen Gruppen durch lineare Substitutionen". The Frobenius norm is also called the Hilbert-Schmidt norm.

12.5 Frobenius Norm

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \displaystyle\dfrac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F}$

Conditions: $\boldsymbol{X} \neq \boldsymbol{O}$ (non-zero matrix)

Proof

The Frobenius norm is the square root of its square.

\begin{equation}\|\boldsymbol{X}\|_F = \sqrt{\|\boldsymbol{X}\|_F^2} \label{eq:12-5-1}\end{equation}

Define an inner function.

\begin{equation}g = \|\boldsymbol{X}\|_F^2 \label{eq:12-5-2}\end{equation}

Then $f = \|\boldsymbol{X}\|_F = \sqrt{g}$. To apply the chain rule (Ref. 1.26), we first compute the derivative of $f$ with respect to $g$.

\begin{equation}\dfrac{\partial f}{\partial g} = \dfrac{\partial}{\partial g} \sqrt{g} = \dfrac{1}{2\sqrt{g}} = \dfrac{1}{2\|\boldsymbol{X}\|_F} \label{eq:12-5-3}\end{equation}

From Eq. \eqref{eq:12-4-5} in section 12.4, the derivative of $g = \|\boldsymbol{X}\|_F^2$ is

\begin{equation}\dfrac{\partial g}{\partial \boldsymbol{X}} = 2\boldsymbol{X} \label{eq:12-5-4}\end{equation}

Apply the chain rule (Ref. 1.26).

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{X}} = \dfrac{\partial f}{\partial g} \cdot \dfrac{\partial g}{\partial \boldsymbol{X}} \label{eq:12-5-5}\end{equation}

Substitute Eqs. \eqref{eq:12-5-3} and \eqref{eq:12-5-4} into Eq. \eqref{eq:12-5-5}.

\begin{equation}\dfrac{\partial f}{\partial \boldsymbol{X}} = \dfrac{1}{2\|\boldsymbol{X}\|_F} \cdot 2\boldsymbol{X} = \dfrac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-6}\end{equation}

From Eq. \eqref{eq:12-5-6}, we obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\|_F = \dfrac{\boldsymbol{X}}{\|\boldsymbol{X}\|_F} \label{eq:12-5-7}\end{equation}

Note: Equation \eqref{eq:12-5-7} is the matrix version of the vector 2-norm derivative formula \eqref{eq:12-1-13} (with $\boldsymbol{a} = \boldsymbol{0}$). The gradient of the Frobenius norm is the normalized matrix.

12.6 Difference of Frobenius Norms

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A})$

Conditions: $\boldsymbol{A}$ is a constant matrix

Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{X} - \boldsymbol{A} \label{eq:12-6-1}\end{equation}

Since $\boldsymbol{A}$ is a constant matrix, the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$ is the identity transformation. That is, for each component

\begin{equation}\dfrac{\partial U_{pq}}{\partial X_{rs}} = \delta_{pr}\delta_{qs} \label{eq:12-6-2}\end{equation}

Expand $\|\boldsymbol{U}\|_F^2$ component-wise.

\begin{equation}\|\boldsymbol{U}\|_F^2 = \displaystyle\sum_{i,j} U_{ij}^2 = \displaystyle\sum_{i,j} (X_{ij} - A_{ij})^2 \label{eq:12-6-3}\end{equation}

Take the partial derivative of $\|\boldsymbol{U}\|_F^2$ with respect to component $X_{pq}$. In the double sum of Eq. \eqref{eq:12-6-3}, only the term with $(i, j) = (p, q)$ contains $X_{pq}$.

\begin{equation}\dfrac{\partial}{\partial X_{pq}} \|\boldsymbol{U}\|_F^2 = \dfrac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 \label{eq:12-6-4}\end{equation}

Compute the right-hand side of Eq. \eqref{eq:12-6-4}. The derivative of $(X_{pq} - A_{pq})^2$ by the chain rule (Ref. 1.26) is

\begin{equation}\dfrac{\partial}{\partial X_{pq}} (X_{pq} - A_{pq})^2 = 2(X_{pq} - A_{pq}) \cdot 1 = 2(X_{pq} - A_{pq}) \label{eq:12-6-5}\end{equation}

Equation \eqref{eq:12-6-5} holds for all $(p, q)$, so in matrix form we have

\begin{equation}\left(\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2\right)_{pq} = 2(X_{pq} - A_{pq}) = 2U_{pq} \label{eq:12-6-6}\end{equation}

From Eq. \eqref{eq:12-6-6}, we obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X} - \boldsymbol{A}\|_F^2 = 2(\boldsymbol{X} - \boldsymbol{A}) \label{eq:12-6-7}\end{equation}

Note: Equation \eqref{eq:12-6-7} is important in the matrix version of least squares. The condition for minimizing $\|\boldsymbol{X} - \boldsymbol{A}\|_F^2$ is that the gradient equals zero, i.e., $\boldsymbol{X} = \boldsymbol{A}$. This is consistent with the fact that $\|\boldsymbol{X} - \boldsymbol{A}\|_F^2 \geq 0$ achieves its minimum value of 0 when $\boldsymbol{X} = \boldsymbol{A}$.

12.7 Linear Regression Residual (Left Multiplication)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B})$

Conditions: $\boldsymbol{A}$ is $M \times N$ matrix, $\boldsymbol{X}$ is $N \times P$ matrix, $\boldsymbol{B}$ is $M \times P$ constant matrix

Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B} \label{eq:12-7-1}\end{equation}

$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. The squared Frobenius norm is expressed using the trace.

\begin{equation}\|\boldsymbol{U}\|_F^2 = \text{tr}(\boldsymbol{U}^\top \boldsymbol{U}) \label{eq:12-7-2}\end{equation}

From Eq. \eqref{eq:12-6-7} in section 12.6, the derivative of $\|\boldsymbol{U}\|_F^2$ with respect to $\boldsymbol{U}$ is

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{U}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \label{eq:12-7-3}\end{equation}

Next, we compute the derivative of $\boldsymbol{U}$ with respect to $\boldsymbol{X}$. From Eq. \eqref{eq:12-7-1}, $U_{kl} = \displaystyle\sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl}$.

\begin{equation}U_{kl} = \displaystyle\sum_{n=0}^{N-1} A_{kn} X_{nl} - B_{kl} \label{eq:12-7-4}\end{equation}

Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-7-4}, only the term with $n = i$ and $l = j$ contains $X_{ij}$.

\begin{equation}\dfrac{\partial U_{kl}}{\partial X_{ij}} = A_{ki} \delta_{lj} \label{eq:12-7-5}\end{equation}

Here $\delta_{lj}$ is the Kronecker delta, which is 1 when $l = j$ and 0 otherwise.

Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{k,l} \dfrac{\partial f}{\partial U_{kl}} \dfrac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-7-6}\end{equation}

Substitute $\dfrac{\partial f}{\partial U_{kl}} = 2U_{kl}$ from Eq. \eqref{eq:12-7-3} into Eq. \eqref{eq:12-7-6}.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{k,l} 2U_{kl} \cdot A_{ki} \delta_{lj} \label{eq:12-7-7}\end{equation}

In Eq. \eqref{eq:12-7-7}, due to $\delta_{lj}$, only the term with $l = j$ survives.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{k} 2U_{kj} A_{ki} = 2\displaystyle\sum_{k} A_{ki} U_{kj} \label{eq:12-7-8}\end{equation}

In Eq. \eqref{eq:12-7-8}, $\displaystyle\sum_k A_{ki} U_{kj}$ is the $(i, j)$ element of the matrix product $\boldsymbol{A}^\top \boldsymbol{U}$.

\begin{equation}\displaystyle\sum_{k} A_{ki} U_{kj} = (\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-9}\end{equation}

Substitute Eq. \eqref{eq:12-7-9} into Eq. \eqref{eq:12-7-8}.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{A}^\top \boldsymbol{U})_{ij} \label{eq:12-7-10}\end{equation}

Equation \eqref{eq:12-7-10} holds for all $(i, j)$, so in matrix form we have

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{A}^\top \boldsymbol{U} \label{eq:12-7-11}\end{equation}

Substitute $\boldsymbol{U} = \boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}$ from Eq. \eqref{eq:12-7-1} to obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}\|_F^2 = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{X} - \boldsymbol{B}) \label{eq:12-7-12}\end{equation}

Note: Equation \eqref{eq:12-7-12} is used to derive the least squares solution for linear regression $\boldsymbol{A}\boldsymbol{X} \approx \boldsymbol{B}$. Setting the gradient to zero yields $\boldsymbol{A}^\top \boldsymbol{A} \boldsymbol{X} = \boldsymbol{A}^\top \boldsymbol{B}$, and if $\boldsymbol{A}^\top \boldsymbol{A}$ is invertible, we obtain the normal equation $\boldsymbol{X} = (\boldsymbol{A}^\top \boldsymbol{A})^{-1} \boldsymbol{A}^\top \boldsymbol{B}$.

12.8 Linear Regression Residual (Right Multiplication)

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top$

Conditions: $\boldsymbol{X}$ is $M \times N$ matrix, $\boldsymbol{A}$ is $N \times P$ constant matrix, $\boldsymbol{B}$ is $M \times P$ constant matrix

Proof

Introduce an auxiliary variable.

\begin{equation}\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B} \label{eq:12-8-1}\end{equation}

$\boldsymbol{U} \in \mathbb{R}^{M \times P}$. From Eq. \eqref{eq:12-8-1}, $U_{kl} = \displaystyle\sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl}$.

\begin{equation}U_{kl} = \displaystyle\sum_{n=0}^{N-1} X_{kn} A_{nl} - B_{kl} \label{eq:12-8-2}\end{equation}

Take the partial derivative of $U_{kl}$ with respect to $X_{ij}$. In Eq. \eqref{eq:12-8-2}, only the term with $k = i$ and $n = j$ contains $X_{ij}$.

\begin{equation}\dfrac{\partial U_{kl}}{\partial X_{ij}} = \delta_{ki} A_{jl} \label{eq:12-8-3}\end{equation}

Apply the chain rule. The derivative of $f = \|\boldsymbol{U}\|_F^2$ with respect to $X_{ij}$ is

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{k,l} \dfrac{\partial f}{\partial U_{kl}} \dfrac{\partial U_{kl}}{\partial X_{ij}} \label{eq:12-8-4}\end{equation}

Substitute $\dfrac{\partial f}{\partial U_{kl}} = 2U_{kl}$ and Eq. \eqref{eq:12-8-3} into Eq. \eqref{eq:12-8-4}.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{k,l} 2U_{kl} \cdot \delta_{ki} A_{jl} \label{eq:12-8-5}\end{equation}

In Eq. \eqref{eq:12-8-5}, due to $\delta_{ki}$, only the term with $k = i$ survives.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = \displaystyle\sum_{l} 2U_{il} A_{jl} = 2\displaystyle\sum_{l} U_{il} A_{jl} \label{eq:12-8-6}\end{equation}

In Eq. \eqref{eq:12-8-6}, $\displaystyle\sum_l U_{il} A_{jl}$ is the $(i, j)$ element of the matrix product $\boldsymbol{U} \boldsymbol{A}^\top$.

\begin{equation}\displaystyle\sum_{l} U_{il} A_{jl} = \displaystyle\sum_{l} U_{il} (A^\top)_{lj} = (\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-7}\end{equation}

Substitute Eq. \eqref{eq:12-8-7} into Eq. \eqref{eq:12-8-6}.

\begin{equation}\dfrac{\partial f}{\partial X_{ij}} = 2(\boldsymbol{U} \boldsymbol{A}^\top)_{ij} \label{eq:12-8-8}\end{equation}

Equation \eqref{eq:12-8-8} holds for all $(i, j)$, so in matrix form we have

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{U}\|_F^2 = 2\boldsymbol{U} \boldsymbol{A}^\top \label{eq:12-8-9}\end{equation}

Substitute $\boldsymbol{U} = \boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}$ from Eq. \eqref{eq:12-8-1} to obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{X}} \|\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B}\|_F^2 = 2(\boldsymbol{X}\boldsymbol{A} - \boldsymbol{B})\boldsymbol{A}^\top \label{eq:12-8-10}\end{equation}

Note: Equation \eqref{eq:12-8-10} can be viewed as the transpose version of 12.7. Depending on the positions of $\boldsymbol{A}$ and $\boldsymbol{X}$, whether $\boldsymbol{A}^\top$ appears on the left or right is determined. If $\boldsymbol{A}$ is on the left of $\boldsymbol{X}$, then $\boldsymbol{A}^\top$ appears on the left; if $\boldsymbol{A}$ is on the right of $\boldsymbol{X}$, then $\boldsymbol{A}^\top$ appears on the right.

12.9 Regression Weight Gradient

Formula: $\displaystyle\dfrac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y})$

Conditions: $\boldsymbol{X}$ is $N \times D$ data matrix, $\boldsymbol{w}$ is $D$-dimensional weight vector, $\boldsymbol{y}$ is $N$-dimensional target vector

Proof

Define the residual vector.

\begin{equation}\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y} \label{eq:12-9-1}\end{equation}

$\boldsymbol{r} \in \mathbb{R}^N$. The loss function is the squared norm of the residual.

\begin{equation}L = \|\boldsymbol{r}\|_2^2 = \boldsymbol{r}^\top \boldsymbol{r} \label{eq:12-9-2}\end{equation}

Expand Eq. \eqref{eq:12-9-2} component-wise.

\begin{equation}L = \displaystyle\sum_{i=0}^{N-1} r_i^2 \label{eq:12-9-3}\end{equation}

From Eq. \eqref{eq:12-9-1}, each residual component is expressed as

\begin{equation}r_i = \displaystyle\sum_{j=0}^{D-1} X_{ij} w_j - y_i \label{eq:12-9-4}\end{equation}

Take the partial derivative of $r_i$ with respect to $w_k$. In Eq. \eqref{eq:12-9-4}, only the term with $j = k$ contains $w_k$.

\begin{equation}\dfrac{\partial r_i}{\partial w_k} = X_{ik} \label{eq:12-9-5}\end{equation}

Apply the chain rule to compute the derivative of $L$ with respect to $w_k$.

\begin{equation}\dfrac{\partial L}{\partial w_k} = \displaystyle\sum_{i=0}^{N-1} \dfrac{\partial L}{\partial r_i} \dfrac{\partial r_i}{\partial w_k} \label{eq:12-9-6}\end{equation}

From Eq. \eqref{eq:12-9-3}, $\dfrac{\partial L}{\partial r_i} = 2r_i$. Substitute this into Eq. \eqref{eq:12-9-6}.

\begin{equation}\dfrac{\partial L}{\partial w_k} = \displaystyle\sum_{i=0}^{N-1} 2r_i \cdot X_{ik} = 2\displaystyle\sum_{i=0}^{N-1} X_{ik} r_i \label{eq:12-9-7}\end{equation}

In Eq. \eqref{eq:12-9-7}, $\displaystyle\sum_i X_{ik} r_i$ is the $k$-th element of the matrix product $\boldsymbol{X}^\top \boldsymbol{r}$.

\begin{equation}\displaystyle\sum_{i=0}^{N-1} X_{ik} r_i = \displaystyle\sum_{i=0}^{N-1} (X^\top)_{ki} r_i = (\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-8}\end{equation}

Substitute Eq. \eqref{eq:12-9-8} into Eq. \eqref{eq:12-9-7}.

\begin{equation}\dfrac{\partial L}{\partial w_k} = 2(\boldsymbol{X}^\top \boldsymbol{r})_k \label{eq:12-9-9}\end{equation}

Equation \eqref{eq:12-9-9} holds for all $k = 0, \ldots, D-1$, so in vector form we have

\begin{equation}\dfrac{\partial L}{\partial \boldsymbol{w}} = 2\boldsymbol{X}^\top \boldsymbol{r} \label{eq:12-9-10}\end{equation}

Substitute $\boldsymbol{r} = \boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}$ from Eq. \eqref{eq:12-9-1} to obtain the final result.

\begin{equation}\dfrac{\partial}{\partial \boldsymbol{w}} \|\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}\|_2^2 = 2\boldsymbol{X}^\top(\boldsymbol{X}\boldsymbol{w} - \boldsymbol{y}) \label{eq:12-9-11}\end{equation}

Note: Equation \eqref{eq:12-9-11} is the most fundamental gradient in machine learning, used in gradient descent for linear regression. Setting the gradient to zero yields $\boldsymbol{X}^\top \boldsymbol{X} \boldsymbol{w} = \boldsymbol{X}^\top \boldsymbol{y}$, and if $\boldsymbol{X}^\top \boldsymbol{X}$ is invertible, we obtain the optimal solution $\boldsymbol{w}^* = (\boldsymbol{X}^\top \boldsymbol{X})^{-1} \boldsymbol{X}^\top \boldsymbol{y}$. When $\boldsymbol{X}^\top \boldsymbol{X}$ is not invertible, the pseudo-inverse $\boldsymbol{X}^+ = (\boldsymbol{X}^\top \boldsymbol{X})^+ \boldsymbol{X}^\top$ is used.

References

Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
Matrix calculus - Wikipedia