Proofs Chapter 2: Scalar by Vector Derivatives
Proofs: Scalar by Vector
In this chapter, we rigorously prove formulas for differentiating scalar-valued functions with respect to vectors. The gradient plays a central role throughout machine learning and statistics, including backpropagation in neural networks, optimization algorithms (gradient descent, Newton's method), and maximum likelihood estimation of statistical models. We begin with the derivative of a constant vector and proceed to linear functions, inner products, quadratic forms, norms, and derivatives of exponential and logarithmic functions. All proofs follow the denominator layout convention.
Prerequisites: Basic formulas from Chapter 1 (Scalar Derivatives of Single Variable). Chapters that use results from this chapter: Chapter 5 (Trace Derivatives), Chapter 10 (Quadratic Form Derivatives), Chapter 12 (Norm Derivatives).
Unless otherwise stated, the formulas in this chapter hold under the following conditions:
- All formulas are based on the denominator layout
- Differentiating a scalar $f$ with respect to a vector $\boldsymbol{x} \in \mathbb{R}^N$ yields a column vector $\dfrac{\partial f}{\partial \boldsymbol{x}} \in \mathbb{R}^N$
- Functions are defined on a differentiable open set, and singular points (e.g., $\boldsymbol{x} = \boldsymbol{a}$ for the norm) are noted individually
2.1 Derivative of a Constant
Proof
In the denominator layout, differentiating a scalar $a$ with respect to a vector $\boldsymbol{x}$ yields a column vector whose components are the partial derivatives with respect to each component.
First, we recall the definition of the gradient vector. Differentiating a scalar function with respect to a vector produces a vector of partial derivatives with respect to each component.
\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial a}{\partial x_0} \\[1em] \dfrac{\partial a}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial a}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-1-1} \end{equation}
Here, note that $a$ is a constant. A constant is a value that does not depend on any of the variables $x_0, x_1, \ldots, x_{N-1}$.
Differentiating a value that does not depend on a variable with respect to that variable yields 0. This is a fundamental property of differentiation. Therefore, for each component we have:
\begin{equation} \dfrac{\partial a}{\partial x_k} = 0 \quad (k = 0, 1, \ldots, N-1) \label{eq:2-1-2} \end{equation}
Substituting \eqref{eq:2-1-2} into \eqref{eq:2-1-1}, all components become 0.
\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} \label{eq:2-1-3} \end{equation}
A vector whose components are all 0 is called the zero vector $\boldsymbol{0}$. Therefore, the final result is:
\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \boldsymbol{0} \label{eq:2-1-4} \end{equation}
2.2 Derivative of an Inner Product
Proof
First, we recall the definition of the inner product $\boldsymbol{a}^\top \boldsymbol{x}$. It is expressed as the product of the row vector $\boldsymbol{a}^\top$ and the column vector $\boldsymbol{x}$.
\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = \begin{pmatrix} a_0 & a_1 & \cdots & a_{N-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-2-1} \end{equation}
Computing the matrix product: the product of a row vector and a column vector is a scalar obtained by summing the products of corresponding components.
\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = a_0 x_0 + a_1 x_1 + \cdots + a_{N-1} x_{N-1} \label{eq:2-2-2} \end{equation}
Expressing this concisely using summation notation:
\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = \sum_{n=0}^{N-1} a_n x_n \label{eq:2-2-3} \end{equation}
Next, we take the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In the sum, only the term $a_k x_k$ contains $x_k$; the other terms ($a_n x_n$ for $n \neq k$) do not depend on $x_k$.
\begin{equation} \dfrac{\partial}{\partial x_k} \sum_{n=0}^{N-1} a_n x_n = \dfrac{\partial}{\partial x_k} (a_0 x_0 + \cdots + a_k x_k + \cdots + a_{N-1} x_{N-1}) \label{eq:2-2-4} \end{equation}
Differentiating terms that do not depend on $x_k$ yields 0. Therefore, only the $a_k x_k$ term remains.
\begin{equation} \dfrac{\partial}{\partial x_k} \sum_{n=0}^{N-1} a_n x_n = 0 + \cdots + \dfrac{\partial}{\partial x_k}(a_k x_k) + \cdots + 0 \label{eq:2-2-5} \end{equation}
Since $a_k$ is a constant, it can be factored out of the derivative. Since $\dfrac{\partial x_k}{\partial x_k} = 1$, we obtain:
\begin{equation} \dfrac{\partial}{\partial x_k}(a_k x_k) = a_k \cdot \dfrac{\partial x_k}{\partial x_k} = a_k \cdot 1 = a_k \label{eq:2-2-6} \end{equation}
Therefore, the partial derivative of the inner product with respect to $x_k$ is:
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{a}^\top \boldsymbol{x}) = a_k \label{eq:2-2-7} \end{equation}
Assembling the result \eqref{eq:2-2-7} for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{a}^\top \boldsymbol{x}) = \begin{pmatrix} \dfrac{\partial}{\partial x_0} (\boldsymbol{a}^\top \boldsymbol{x}) \\[1em] \dfrac{\partial}{\partial x_1} (\boldsymbol{a}^\top \boldsymbol{x}) \\[1em] \vdots \\[0.5em] \dfrac{\partial}{\partial x_{N-1}} (\boldsymbol{a}^\top \boldsymbol{x}) \end{pmatrix} = \begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_{N-1} \end{pmatrix} \label{eq:2-2-8} \end{equation}
The vector on the right-hand side is $\boldsymbol{a}$ itself. Therefore, the final result is:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{a}^\top \boldsymbol{x}) = \boldsymbol{a} \label{eq:2-2-9} \end{equation}
2.3 Derivative of the Squared 2-Norm
Proof
First, we write $\boldsymbol{x}^\top \boldsymbol{x}$ in component form. It is the product of the row vector $\boldsymbol{x}^\top$ and the column vector $\boldsymbol{x}$.
\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = \begin{pmatrix} x_0 & x_1 & \cdots & x_{N-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-3-1} \end{equation}
Computing the matrix product: multiplying corresponding components and summing gives the sum of squares of each component.
\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = x_0 \cdot x_0 + x_1 \cdot x_1 + \cdots + x_{N-1} \cdot x_{N-1} \label{eq:2-3-2} \end{equation}
Simplifying, this can be written as:
\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = x_0^2 + x_1^2 + \cdots + x_{N-1}^2 \label{eq:2-3-3} \end{equation}
Expressing this concisely using summation notation, it equals the squared 2-norm $\|\boldsymbol{x}\|_2^2$.
\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = \sum_{n=0}^{N-1} x_n^2 = \|\boldsymbol{x}\|_2^2 \label{eq:2-3-4} \end{equation}
Next, we take the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In the sum, only the term $x_k^2$ contains $x_k$.
\begin{equation} \dfrac{\partial}{\partial x_k} \sum_{n=0}^{N-1} x_n^2 = \dfrac{\partial}{\partial x_k} (x_0^2 + \cdots + x_k^2 + \cdots + x_{N-1}^2) \label{eq:2-3-5} \end{equation}
Differentiating terms that do not depend on $x_k$ ($x_n^2$ for $n \neq k$) yields 0. Therefore, only the $x_k^2$ term remains.
\begin{equation} \dfrac{\partial}{\partial x_k} \sum_{n=0}^{N-1} x_n^2 = 0 + \cdots + \dfrac{\partial}{\partial x_k}(x_k^2) + \cdots + 0 \label{eq:2-3-6} \end{equation}
Applying the power rule (1.18): $\dfrac{d}{dx}(x^n) = n x^{n-1}$ with $n = 2$ gives:
\begin{equation} \dfrac{\partial}{\partial x_k}(x_k^2) = 2 x_k^{2-1} = 2x_k \label{eq:2-3-7} \end{equation}
Assembling the result \eqref{eq:2-3-7} for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = \begin{pmatrix} 2x_0 \\ 2x_1 \\ \vdots \\ 2x_{N-1} \end{pmatrix} \label{eq:2-3-8} \end{equation}
Factoring out the common factor 2:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2 \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-3-9} \end{equation}
The column vector on the right-hand side is $\boldsymbol{x}$ itself. Therefore, the final result is:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2\boldsymbol{x} \label{eq:2-3-10} \end{equation}
2.4 Bilinear Form
Proof
First, we compute $\boldsymbol{A}\boldsymbol{x}$ in components. Since $\boldsymbol{A}$ is an $M \times N$ matrix and $\boldsymbol{x}$ is an $N$-dimensional column vector, the result is an $M$-dimensional column vector.
\begin{equation} (\boldsymbol{A}\boldsymbol{x})_i = \sum_{j=0}^{N-1} A_{ij} x_j \quad (i = 0, 1, \ldots, M-1) \label{eq:2-4-1} \end{equation}
That is, the $i$-th component of $\boldsymbol{A}\boldsymbol{x}$ is the inner product of the $i$-th row of $\boldsymbol{A}$ with $\boldsymbol{x}$.
Next, we compute $\boldsymbol{b}^\top (\boldsymbol{A}\boldsymbol{x})$. This is the inner product of two $M$-dimensional vectors.
\begin{equation} \boldsymbol{b}^\top (\boldsymbol{A}\boldsymbol{x}) = \sum_{i=0}^{M-1} b_i \cdot (\boldsymbol{A}\boldsymbol{x})_i \label{eq:2-4-2} \end{equation}
Substituting \eqref{eq:2-4-1} into \eqref{eq:2-4-2}:
\begin{equation} \boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x} = \sum_{i=0}^{M-1} b_i \cdot \left( \sum_{j=0}^{N-1} A_{ij} x_j \right) \label{eq:2-4-3} \end{equation}
Swapping the order of summation and combining into a single double sum. Since $b_i$ does not depend on $j$, it can be brought inside the inner sum.
\begin{equation} \boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x} = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} b_i A_{ij} x_j \label{eq:2-4-4} \end{equation}
Taking the partial derivative of this expression with respect to $x_k$ (the $k$-th component). Since differentiation is a linear operator, it commutes with the summation.
\begin{equation} \dfrac{\partial}{\partial x_k} \left( \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} b_i A_{ij} x_j \right) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} b_i A_{ij} \dfrac{\partial x_j}{\partial x_k} \label{eq:2-4-5} \end{equation}
Here, $\dfrac{\partial x_j}{\partial x_k}$ equals the Kronecker delta, which is 1 when $j = k$ and 0 otherwise.
\begin{equation} \dfrac{\partial x_j}{\partial x_k} = \delta_{jk} = \begin{cases} 1 & (j = k) \\ 0 & (j \neq k) \end{cases} \label{eq:2-4-6} \end{equation}
Substituting \eqref{eq:2-4-6} into \eqref{eq:2-4-5}. By $\delta_{jk}$, only the $j = k$ term survives.
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} b_i A_{ij} \delta_{jk} \label{eq:2-4-7} \end{equation}
Applying the delta function over the sum in $j$, only the $j = k$ term is selected.
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \sum_{i=0}^{M-1} b_i A_{ik} \label{eq:2-4-8} \end{equation}
We interpret this result as a matrix product. $\sum_{i=0}^{M-1} b_i A_{ik}$ is the inner product of the row vector $\boldsymbol{b}^\top$ with the $k$-th column of $\boldsymbol{A}$. We verify that this equals $(\boldsymbol{A}^\top \boldsymbol{b})_k$.
The $k$-th row of $\boldsymbol{A}^\top$ is the transpose of the $k$-th column of $\boldsymbol{A}$. Using the definition of the transpose $(\boldsymbol{A}^\top)_{ki} = A_{ik}$:
\begin{equation} (\boldsymbol{A}^\top \boldsymbol{b})_k = \sum_{i=0}^{M-1} (\boldsymbol{A}^\top)_{ki} b_i = \sum_{i=0}^{M-1} A_{ik} b_i \label{eq:2-4-9} \end{equation}
Rearranging the order of the product, we see it matches \eqref{eq:2-4-8}.
\begin{equation} (\boldsymbol{A}^\top \boldsymbol{b})_k = \sum_{i=0}^{M-1} b_i A_{ik} \label{eq:2-4-10} \end{equation}
Therefore, the partial derivative with respect to $x_k$ is:
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A}^\top \boldsymbol{b})_k \label{eq:2-4-11} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ gives the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \boldsymbol{A}^\top \boldsymbol{b} \label{eq:2-4-12} \end{equation}
2.5 Quadratic Form
Proof
First, we write the quadratic form $\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}$ in component form. The $i$-th component of $\boldsymbol{A}\boldsymbol{x}$ is $\sum_{j=0}^{N-1} A_{ij} x_j$.
\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \boldsymbol{x}^\top (\boldsymbol{A} \boldsymbol{x}) \label{eq:2-5-1} \end{equation}
Expanding according to the definition of the inner product:
\begin{equation} \boldsymbol{x}^\top (\boldsymbol{A} \boldsymbol{x}) = \sum_{i=0}^{N-1} x_i \cdot (\boldsymbol{A}\boldsymbol{x})_i \label{eq:2-5-2} \end{equation}
Substituting the definition of $(\boldsymbol{A}\boldsymbol{x})_i$:
\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \sum_{i=0}^{N-1} x_i \cdot \left( \sum_{j=0}^{N-1} A_{ij} x_j \right) \label{eq:2-5-3} \end{equation}
Removing the parentheses and combining into a double sum:
\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} x_i x_j \label{eq:2-5-4} \end{equation}
Taking the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In each term $A_{ij} x_i x_j$ of the double sum, $x_k$ appears when $i = k$ or $j = k$.
Using the product rule (Leibniz rule, 1.25), and noting that $\dfrac{\partial x_i}{\partial x_k} = \delta_{ik}$ (Kronecker delta):
\begin{equation} \dfrac{\partial (x_i x_j)}{\partial x_k} = \dfrac{\partial x_i}{\partial x_k} \cdot x_j + x_i \cdot \dfrac{\partial x_j}{\partial x_k} = \delta_{ik} x_j + x_i \delta_{jk} \label{eq:2-5-5} \end{equation}
Applying this result to the double sum. Since $A_{ij}$ is a constant, it can be factored out of the derivative.
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} (\delta_{ik} x_j + x_i \delta_{jk}) \label{eq:2-5-6} \end{equation}
Splitting the sum into two parts:
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} \delta_{ik} x_j + \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} x_i \delta_{jk} \label{eq:2-5-7} \end{equation}
Computing the first term. Since $\delta_{ik} = 1$ only when $i = k$, applying the delta function over the sum in $i$ selects only the $i = k$ term.
\begin{equation} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} \delta_{ik} x_j = \sum_{j=0}^{N-1} A_{kj} x_j \label{eq:2-5-8} \end{equation}
This is $(\boldsymbol{A}\boldsymbol{x})_k$, i.e., the $k$-th component of $\boldsymbol{A}\boldsymbol{x}$.
\begin{equation} \sum_{j=0}^{N-1} A_{kj} x_j = (\boldsymbol{A}\boldsymbol{x})_k \label{eq:2-5-9} \end{equation}
Computing the second term. Since $\delta_{jk} = 1$ only when $j = k$, applying the delta function over the sum in $j$:
\begin{equation} \sum_{i=0}^{N-1} \sum_{j=0}^{N-1} A_{ij} x_i \delta_{jk} = \sum_{i=0}^{N-1} A_{ik} x_i \label{eq:2-5-10} \end{equation}
We interpret the second term as a component of $\boldsymbol{A}^\top \boldsymbol{x}$. Using the definition of the transpose $(\boldsymbol{A}^\top)_{ki} = A_{ik}$:
\begin{equation} (\boldsymbol{A}^\top \boldsymbol{x})_k = \sum_{i=0}^{N-1} (\boldsymbol{A}^\top)_{ki} x_i = \sum_{i=0}^{N-1} A_{ik} x_i \label{eq:2-5-11} \end{equation}
Comparing \eqref{eq:2-5-10} and \eqref{eq:2-5-11}, we see they are equal.
\begin{equation} \sum_{i=0}^{N-1} A_{ik} x_i = (\boldsymbol{A}^\top \boldsymbol{x})_k \label{eq:2-5-12} \end{equation}
Substituting \eqref{eq:2-5-9} and \eqref{eq:2-5-12} into \eqref{eq:2-5-7}:
\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A}\boldsymbol{x})_k + (\boldsymbol{A}^\top \boldsymbol{x})_k \label{eq:2-5-13} \end{equation}
Rearranging the right-hand side as a sum of matrices:
\begin{equation} (\boldsymbol{A}\boldsymbol{x})_k + (\boldsymbol{A}^\top \boldsymbol{x})_k = ((\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x})_k \label{eq:2-5-14} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ gives the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x} \label{eq:2-5-15} \end{equation}
Remark 2: When $\boldsymbol{A} = \boldsymbol{I}$ (the identity matrix), $\boldsymbol{I} + \boldsymbol{I}^\top = 2\boldsymbol{I}$, so $$\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2\boldsymbol{I}\boldsymbol{x} = 2\boldsymbol{x}$$ which is consistent with formula (2.3).
2.6 Derivative of the 2-Norm
Proof
First, we write the 2-norm $\|\boldsymbol{x} - \boldsymbol{a}\|$ according to its definition.
\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{(\boldsymbol{x} - \boldsymbol{a})^\top (\boldsymbol{x} - \boldsymbol{a})} \label{eq:2-6-1} \end{equation}
Expanding in component form:
\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{\sum_{i=0}^{N-1} (x_i - a_i)^2} \label{eq:2-6-2} \end{equation}
For notational convenience, let $u = \|\boldsymbol{x} - \boldsymbol{a}\|^2$.
\begin{equation} u = \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \sum_{i=0}^{N-1} (x_i - a_i)^2 \label{eq:2-6-3} \end{equation}
Then $\|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{u} = u^{1/2}$.
\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = u^{1/2} \label{eq:2-6-4} \end{equation}
Taking the partial derivative of $u^{1/2}$ with respect to $x_k$, using the chain rule (1.26):
\begin{equation} \dfrac{\partial}{\partial x_k} (u^{1/2}) = \dfrac{d}{du}(u^{1/2}) \cdot \dfrac{\partial u}{\partial x_k} \label{eq:2-6-5} \end{equation}
First, we compute $\dfrac{d}{du}(u^{1/2})$. Applying the power rule (1.19) $\dfrac{d}{du}(u^n) = n u^{n-1}$:
\begin{equation} \dfrac{d}{du}(u^{1/2}) = \dfrac{1}{2} u^{1/2 - 1} = \dfrac{1}{2} u^{-1/2} \label{eq:2-6-6} \end{equation}
Expressing this in terms of the original variables:
\begin{equation} \dfrac{1}{2} u^{-1/2} = \dfrac{1}{2\sqrt{u}} = \dfrac{1}{2\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-7} \end{equation}
Next, we compute $\dfrac{\partial u}{\partial x_k}$. From the definition of $u$ in \eqref{eq:2-6-3}:
\begin{equation} \dfrac{\partial u}{\partial x_k} = \dfrac{\partial}{\partial x_k} \sum_{i=0}^{N-1} (x_i - a_i)^2 \label{eq:2-6-8} \end{equation}
In the sum, only the term $(x_k - a_k)^2$ contains $x_k$. The other terms do not depend on $x_k$, so differentiating them yields 0.
\begin{equation} \dfrac{\partial u}{\partial x_k} = \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 \label{eq:2-6-9} \end{equation}
Differentiating $(x_k - a_k)^2$ with respect to $x_k$, applying the chain rule (1.26):
\begin{equation} \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 = 2(x_k - a_k) \cdot \dfrac{\partial}{\partial x_k}(x_k - a_k) \label{eq:2-6-10} \end{equation}
Since $a_k$ is a constant, $\dfrac{\partial}{\partial x_k}(x_k - a_k) = 1$.
\begin{equation} \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 = 2(x_k - a_k) \cdot 1 = 2(x_k - a_k) \label{eq:2-6-11} \end{equation}
Substituting \eqref{eq:2-6-7} and \eqref{eq:2-6-11} into \eqref{eq:2-6-5}:
\begin{equation} \dfrac{\partial}{\partial x_k} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{1}{2\|\boldsymbol{x} - \boldsymbol{a}\|} \cdot 2(x_k - a_k) \label{eq:2-6-12} \end{equation}
The 2 in the numerator and the 2 in the denominator cancel.
\begin{equation} \dfrac{\partial}{\partial x_k} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{x_k - a_k}{\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-13} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \begin{pmatrix} \dfrac{x_0 - a_0}{\|\boldsymbol{x} - \boldsymbol{a}\|} \\[1em] \dfrac{x_1 - a_1}{\|\boldsymbol{x} - \boldsymbol{a}\|} \\[1em] \vdots \\[0.5em] \dfrac{x_{N-1} - a_{N-1}}{\|\boldsymbol{x} - \boldsymbol{a}\|} \end{pmatrix} \label{eq:2-6-14} \end{equation}
Since the denominator $\|\boldsymbol{x} - \boldsymbol{a}\|$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{1}{\|\boldsymbol{x} - \boldsymbol{a}\|} \begin{pmatrix} x_0 - a_0 \\ x_1 - a_1 \\ \vdots \\ x_{N-1} - a_{N-1} \end{pmatrix} \label{eq:2-6-15} \end{equation}
The column vector on the right-hand side is $\boldsymbol{x} - \boldsymbol{a}$ itself. Therefore, the final result is:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-16} \end{equation}
Remark 2: When $\boldsymbol{a} = \boldsymbol{0}$, this becomes $\dfrac{\boldsymbol{x}}{\|\boldsymbol{x}\|}$.
Non-differentiability at $\boldsymbol{x} = \boldsymbol{a}$: At $\boldsymbol{x} = \boldsymbol{a}$, the denominator becomes 0 and the gradient is undefined. Geometrically, the norm function has a "kink" at this point, so no unique tangent direction exists. In the context of optimization, instead of the gradient at this point, the subdifferential (set of subgradients) $\{\boldsymbol{g} : \|\boldsymbol{g}\| \leq 1\}$ is used.
2.7 Derivative of the Squared 2-Norm
Proof
First, we express $\|\boldsymbol{x} - \boldsymbol{a}\|^2$ in inner product form.
\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = (\boldsymbol{x} - \boldsymbol{a})^\top (\boldsymbol{x} - \boldsymbol{a}) \label{eq:2-7-1} \end{equation}
For notational convenience, define the vector-valued function $\boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x} - \boldsymbol{a}$.
\begin{equation} \boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x} - \boldsymbol{a} \label{eq:2-7-2} \end{equation}
Then \eqref{eq:2-7-1} can be written as:
\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-3} \end{equation}
Since $\boldsymbol{a}$ is a constant, the Jacobian matrix of $\boldsymbol{f}(\boldsymbol{x})$ is the identity matrix.
\begin{equation} \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:2-7-4} \end{equation}
From formula (2.3), $\dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) = 2\boldsymbol{f}$.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) = 2\boldsymbol{f} \label{eq:2-7-5} \end{equation}
Applying the chain rule (1.26):
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \cdot \dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) \label{eq:2-7-6} \end{equation}
Substituting \eqref{eq:2-7-4} and \eqref{eq:2-7-5} into \eqref{eq:2-7-6}:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \boldsymbol{I} \cdot 2\boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-7} \end{equation}
The product of the identity matrix and a vector is the vector itself.
\begin{equation} \boldsymbol{I} \cdot 2\boldsymbol{f}(\boldsymbol{x}) = 2\boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-8} \end{equation}
Substituting the definition of $\boldsymbol{f}(\boldsymbol{x})$ from \eqref{eq:2-7-2} gives the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = 2(\boldsymbol{x} - \boldsymbol{a}) \label{eq:2-7-9} \end{equation}
2.8 Derivative of the Inner Product (Dot Product)
Proof
We write the inner product $\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})$ in component form. Let $f_m(\boldsymbol{x})$ denote the $m$-th component of $\boldsymbol{f}(\boldsymbol{x})$, and $g_m(\boldsymbol{x})$ the $m$-th component of $\boldsymbol{g}(\boldsymbol{x})$.
\begin{equation} \boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x}) = \sum_{m=0}^{M-1} f_m(\boldsymbol{x}) g_m(\boldsymbol{x}) \label{eq:2-8-1} \end{equation}
Taking the partial derivative of this expression with respect to $x_n$ (the $n$-th component of $\boldsymbol{x}$):
\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \dfrac{\partial}{\partial x_n} \sum_{m=0}^{M-1} f_m(\boldsymbol{x}) g_m(\boldsymbol{x}) \label{eq:2-8-2} \end{equation}
Since differentiation is a linear operator, it commutes with the summation.
\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \sum_{m=0}^{M-1} \dfrac{\partial}{\partial x_n} (f_m(\boldsymbol{x}) g_m(\boldsymbol{x})) \label{eq:2-8-3} \end{equation}
Applying the product rule (Leibniz rule, 1.25) to each term $f_m(\boldsymbol{x}) g_m(\boldsymbol{x})$:
\begin{equation} \dfrac{\partial}{\partial x_n} (f_m(\boldsymbol{x}) g_m(\boldsymbol{x})) = \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-4} \end{equation}
Substituting \eqref{eq:2-8-4} into \eqref{eq:2-8-3}:
\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \sum_{m=0}^{M-1} \left( \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \right) \label{eq:2-8-5} \end{equation}
Splitting the sum into two parts:
\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \sum_{m=0}^{M-1} \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + \sum_{m=0}^{M-1} f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-6} \end{equation}
Interpreting the first term as a matrix product. $\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}}$ is an $N \times M$ matrix whose $(n, m)$ entry is $\dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n}$.
\begin{equation} \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)_{nm} = \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-7} \end{equation}
The $n$-th component of the product of this matrix with $\boldsymbol{g}(\boldsymbol{x})$ equals the first sum.
\begin{equation} \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x})\right)_n = \sum_{m=0}^{M-1} \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) \label{eq:2-8-8} \end{equation}
Similarly, the second term can be interpreted as:
\begin{equation} \left(\dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x})\right)_n = \sum_{m=0}^{M-1} \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} f_m(\boldsymbol{x}) \label{eq:2-8-9} \end{equation}
Since scalar multiplication is commutative, $f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} = \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} f_m(\boldsymbol{x})$.
Replacing the first and second terms of \eqref{eq:2-8-6} with \eqref{eq:2-8-8} and \eqref{eq:2-8-9}:
\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x})\right)_n + \left(\dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x})\right)_n \label{eq:2-8-10} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ gives the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x}) + \dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x}) \label{eq:2-8-11} \end{equation}
- $\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \in \mathbb{R}^{N \times M}$ ($N$-by-$M$ matrix)
- $\boldsymbol{g}(\boldsymbol{x}) \in \mathbb{R}^M$ ($M$-dimensional column vector)
- Product $\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x}) \in \mathbb{R}^N$ ($N$-dimensional column vector)
2.9 Product Rule for Scalar Functions
Proof
Consider the $n$-th component of the gradient vector. We take the partial derivative of the product $f(\boldsymbol{x}) g(\boldsymbol{x})$ with respect to $x_n$.
Applying the standard product rule (Leibniz rule, 1.25):
\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial x_n} = \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \cdot g(\boldsymbol{x}) + f(\boldsymbol{x}) \cdot \dfrac{\partial g(\boldsymbol{x})}{\partial x_n} \label{eq:2-9-1} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-2} \end{equation}
Rewriting this as a sum of two vectors:
\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} g(\boldsymbol{x}) \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} g(\boldsymbol{x}) \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} g(\boldsymbol{x}) \end{pmatrix} + \begin{pmatrix} \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-3} \end{equation}
Since $f(\boldsymbol{x})$ and $g(\boldsymbol{x})$ are scalars, they can be factored out as common factors from each component of the vectors.
\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = g(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} + f(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-4} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-9-5} \end{equation}
2.10 Sum and Difference Rule
Proof
Consider the $n$-th component of the gradient vector. We take the partial derivative of $f + g$ with respect to $x_n$.
By the linearity of differentiation (1.24), the derivative of a sum equals the sum of derivatives.
\begin{equation} \dfrac{\partial (f + g)}{\partial x_n} = \dfrac{\partial f}{\partial x_n} + \dfrac{\partial g}{\partial x_n} \label{eq:2-10-1} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f}{\partial x_0} + \dfrac{\partial g}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} + \dfrac{\partial g}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} + \dfrac{\partial g}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-10-2} \end{equation}
Rewriting this as a sum of two vectors:
\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} + \begin{pmatrix} \dfrac{\partial g}{\partial x_0} \\[1em] \dfrac{\partial g}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial g}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-10-3} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \dfrac{\partial f}{\partial \boldsymbol{x}} + \dfrac{\partial g}{\partial \boldsymbol{x}} \label{eq:2-10-4} \end{equation}
The same argument applies to the difference $f - g$. By the linearity of differentiation, the derivative of a difference equals the difference of derivatives.
\begin{equation} \dfrac{\partial (f - g)}{\partial \boldsymbol{x}} = \dfrac{\partial f}{\partial \boldsymbol{x}} - \dfrac{\partial g}{\partial \boldsymbol{x}} \label{eq:2-10-5} \end{equation}
2.11 Scalar Multiple Rule
Proof
Consider the $n$-th component of the gradient vector. We take the partial derivative of $c \cdot f$ with respect to $x_n$.
Since $c$ is a constant that does not depend on $x_n$, it can be factored out of the derivative. This is part of the linearity of differentiation (1.24).
\begin{equation} \dfrac{\partial (c \cdot f)}{\partial x_n} = c \cdot \dfrac{\partial f}{\partial x_n} \label{eq:2-11-1} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = \begin{pmatrix} \displaystyle c \cdot \dfrac{\partial f}{\partial x_0} \\[1em] \displaystyle c \cdot \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle c \cdot \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-11-2} \end{equation}
Since $c$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = c \cdot \begin{pmatrix} \dfrac{\partial f}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-11-3} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = c \dfrac{\partial f}{\partial \boldsymbol{x}} \label{eq:2-11-4} \end{equation}
2.12 Quotient Rule
Proof
Consider the $n$-th component of the gradient vector. We take the partial derivative of $f(\boldsymbol{x})/g(\boldsymbol{x})$ with respect to $x_n$.
Applying the standard quotient rule (1.28):
\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{g(\boldsymbol{x}) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} - f(\boldsymbol{x}) \cdot \dfrac{\partial g(\boldsymbol{x})}{\partial x_n}}{g(\boldsymbol{x})^2} \label{eq:2-12-1} \end{equation}
Factoring out $\dfrac{1}{g(\boldsymbol{x})^2}$:
\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_n} \right) \label{eq:2-12-2} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \begin{pmatrix} \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \right) \\[1em] \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \right) \\[1em] \vdots \\[0.5em] \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \right) \end{pmatrix} \label{eq:2-12-3} \end{equation}
Since $f(\boldsymbol{x})$, $g(\boldsymbol{x})$, and $1/g(\boldsymbol{x})^2$ are scalars, they can be factored out of the vector.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} - f(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \vdots \\[0.5em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \right) \label{eq:2-12-4} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}} \right) \label{eq:2-12-5} \end{equation}
2.13 Derivative of the Reciprocal
Proof
Consider the $n$-th component of the gradient vector. We take the partial derivative of $1/f(\boldsymbol{x})$ with respect to $x_n$.
Since $1/f(\boldsymbol{x}) = f(\boldsymbol{x})^{-1}$, we prepare to apply the chain rule (1.26) by writing it in this form.
\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = \dfrac{\partial}{\partial x_n} (f(\boldsymbol{x})^{-1}) \label{eq:2-13-1} \end{equation}
By the chain rule (1.26), this becomes "derivative of the outer function" times "derivative of the inner function." Letting $t = f(\boldsymbol{x})$:
\begin{equation} \dfrac{\partial}{\partial x_n} (f(\boldsymbol{x})^{-1}) = \dfrac{d}{dt}(t^{-1}) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \label{eq:2-13-2} \end{equation}
First, we compute $\dfrac{d}{dt}(t^{-1})$. Applying the power rule (1.19) $\dfrac{d}{dt}(t^n) = n t^{n-1}$ with $n = -1$:
\begin{equation} \dfrac{d}{dt}(t^{-1}) = (-1) \cdot t^{-1-1} = -t^{-2} \label{eq:2-13-3} \end{equation}
Rewriting this in fractional form:
\begin{equation} -t^{-2} = -\dfrac{1}{t^2} = -\dfrac{1}{f(\boldsymbol{x})^2} \label{eq:2-13-4} \end{equation}
Substituting \eqref{eq:2-13-4} into \eqref{eq:2-13-2}:
\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \label{eq:2-13-5} \end{equation}
Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = \begin{pmatrix} \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-13-6} \end{equation}
Since $-\dfrac{1}{f(\boldsymbol{x})^2}$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-13-7} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-13-8} \end{equation}
2.14 Power Rule
Proof
Consider the $k$-th component of the gradient vector. We take the partial derivative of $f(\boldsymbol{x})^n$ with respect to $x_k$.
Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function." Letting $t = f(\boldsymbol{x})$:
\begin{equation} \dfrac{\partial}{\partial x_k} (f(\boldsymbol{x})^n) = \dfrac{d}{dt}(t^n) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-14-1} \end{equation}
First, we compute $\dfrac{d}{dt}(t^n)$. Applying the power rule (1.19):
\begin{equation} \dfrac{d}{dt}(t^n) = n \cdot t^{n-1} \label{eq:2-14-2} \end{equation}
Substituting $t = f(\boldsymbol{x})$:
\begin{equation} \dfrac{d}{dt}(t^n) = n \cdot f(\boldsymbol{x})^{n-1} \label{eq:2-14-3} \end{equation}
Substituting \eqref{eq:2-14-3} into \eqref{eq:2-14-1}:
\begin{equation} \dfrac{\partial}{\partial x_k} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-14-4} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = \begin{pmatrix} \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-14-5} \end{equation}
Since $n f(\boldsymbol{x})^{n-1}$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-14-6} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-14-7} \end{equation}
- When $n = 1/2$: we obtain the derivative of the square root $\sqrt{f(\boldsymbol{x})} = f(\boldsymbol{x})^{1/2}$. $\dfrac{\partial}{\partial \boldsymbol{x}} \sqrt{f(\boldsymbol{x})} = \dfrac{1}{2\sqrt{f(\boldsymbol{x})}} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$
- When $n = -1$: we recover the derivative of the reciprocal $1/f(\boldsymbol{x}) = f(\boldsymbol{x})^{-1}$ from (2.13).
- When $n = 2$: the derivative of the square $f(\boldsymbol{x})^2$. $\dfrac{\partial}{\partial \boldsymbol{x}} f(\boldsymbol{x})^2 = 2f(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$
2.15 Derivative of the Exponential Function
Proof
Consider the $k$-th component of the gradient vector. We take the partial derivative of $e^{f(\boldsymbol{x})}$ with respect to $x_k$.
Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function."
\begin{equation} \dfrac{\partial}{\partial x_k} (e^{f(\boldsymbol{x})}) = \dfrac{d}{df}(e^f) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-15-1} \end{equation}
Computing $\dfrac{d}{df}(e^f)$: the derivative of the exponential function $e^f$ is $e^f$ itself (1.20).
\begin{equation} \dfrac{d}{df}(e^f) = e^f \label{eq:2-15-2} \end{equation}
Substituting \eqref{eq:2-15-2} into \eqref{eq:2-15-1}:
\begin{equation} \dfrac{\partial}{\partial x_k} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-15-3} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = \begin{pmatrix} \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-15-4} \end{equation}
Since $e^{f(\boldsymbol{x})}$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-15-5} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-15-6} \end{equation}
2.16 Derivative of the Logarithmic Function
Proof
Consider the $k$-th component of the gradient vector. We take the partial derivative of $\log f(\boldsymbol{x})$ with respect to $x_k$.
Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function."
\begin{equation} \dfrac{\partial}{\partial x_k} (\log f(\boldsymbol{x})) = \dfrac{d}{df}(\log f) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-16-1} \end{equation}
Computing $\dfrac{d}{df}(\log f)$: the derivative of the natural logarithm $\log f$ is $\dfrac{1}{f}$ (1.21).
\begin{equation} \dfrac{d}{df}(\log f) = \dfrac{1}{f} \label{eq:2-16-2} \end{equation}
Substituting \eqref{eq:2-16-2} into \eqref{eq:2-16-1}:
\begin{equation} \dfrac{\partial}{\partial x_k} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-16-3} \end{equation}
Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \begin{pmatrix} \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-16-4} \end{equation}
Since $\dfrac{1}{f(\boldsymbol{x})}$ is common to all components, it can be factored out as a scalar.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-16-5} \end{equation}
Using the definition of the gradient vector to simplify, we obtain the final result.
\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-16-6} \end{equation}
References
- Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
- Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
- Matrix calculus - Wikipedia