What is the derivative of the inner product a^T x with respect to x?

∂(a^T x)/∂x = a. Differentiating a^T x = Σ a_i x_i component-wise gives ∂(a^T x)/∂x_i = a_i, which collects into the vector a. The gradient of a linear function equals its coefficient vector.

What is the derivative of the quadratic form x^T A x?

∂(x^T A x)/∂x = (A + A^T) x. When A is symmetric this simplifies to 2A x. The formula appears everywhere in least squares, the log-likelihood of a Gaussian, and quadratic programming.

What is the derivative of the norm ||x|| = sqrt(x^T x)?

For x ≠ 0, ∂||x||/∂x = x/||x||, the unit vector in the direction of x. Apply the chain rule to x^T x: (1/(2||x||)) · (2x) = x/||x||. Essential for analyzing norm-based regularization and gradient methods.

What is the derivative of exp(a^T x) with respect to x?

∂ exp(a^T x)/∂x = exp(a^T x) · a. By the chain rule, the outer derivative exp(a^T x) multiplies the inner derivative ∂(a^T x)/∂x = a. A building block of softmax and logistic-regression gradients.

Proofs Chapter 2: Scalar by Vector Derivatives

2.1 Derivative of a Constant

Formula: $\dfrac{\partial a}{\partial \boldsymbol{x}} = \boldsymbol{0}$

Conditions: $a \in \mathbb{R}$ is a scalar constant, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

In the denominator layout, differentiating a scalar $a$ with respect to a vector $\boldsymbol{x}$ yields a column vector whose components are the partial derivatives with respect to each component.

First, we recall the definition of the gradient vector. Differentiating a scalar function with respect to a vector produces a vector of partial derivatives with respect to each component.

\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial a}{\partial x_0} \\[1em] \dfrac{\partial a}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial a}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-1-1} \end{equation}

Here, note that $a$ is a constant. A constant is a value that does not depend on any of the variables $x_0, x_1, \ldots, x_{N-1}$.

Differentiating a value that does not depend on a variable with respect to that variable yields 0. This is a fundamental property of differentiation. Therefore, for each component we have:

\begin{equation} \dfrac{\partial a}{\partial x_k} = 0 \quad (k = 0, 1, \ldots, N-1) \label{eq:2-1-2} \end{equation}

Substituting \eqref{eq:2-1-2} into \eqref{eq:2-1-1}, all components become 0.

\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \begin{pmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{pmatrix} \label{eq:2-1-3} \end{equation}

A vector whose components are all 0 is called the zero vector $\boldsymbol{0}$. Therefore, the final result is:

\begin{equation} \dfrac{\partial a}{\partial \boldsymbol{x}} = \boldsymbol{0} \label{eq:2-1-4} \end{equation}

On the $\nabla$ notation: In this document, $\nabla f = \dfrac{\partial f}{\partial \boldsymbol{x}}$ is defined as a column vector (denominator layout). This is standard in the machine learning and optimization literature. In contrast, the numerator layout defines $\nabla f$ as a row vector, so care is needed when consulting references.

2.2 Derivative of an Inner Product

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \boldsymbol{a}^\top \boldsymbol{x} = \boldsymbol{a}$

Conditions: $\boldsymbol{a} \in \mathbb{R}^N$ is a constant vector, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

First, we recall the definition of the inner product $\boldsymbol{a}^\top \boldsymbol{x}$. It is expressed as the product of the row vector $\boldsymbol{a}^\top$ and the column vector $\boldsymbol{x}$.

\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = \begin{pmatrix} a_0 & a_1 & \cdots & a_{N-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-2-1} \end{equation}

Computing the matrix product: the product of a row vector and a column vector is a scalar obtained by summing the products of corresponding components.

\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = a_0 x_0 + a_1 x_1 + \cdots + a_{N-1} x_{N-1} \label{eq:2-2-2} \end{equation}

Expressing this concisely using summation notation:

\begin{equation} \boldsymbol{a}^\top \boldsymbol{x} = \displaystyle\sum_{n=0}^{N-1} a_n x_n \label{eq:2-2-3} \end{equation}

Next, we take the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In the sum, only the term $a_k x_k$ contains $x_k$; the other terms ($a_n x_n$ for $n \neq k$) do not depend on $x_k$.

\begin{equation} \dfrac{\partial}{\partial x_k} \displaystyle\sum_{n=0}^{N-1} a_n x_n = \dfrac{\partial}{\partial x_k} (a_0 x_0 + \cdots + a_k x_k + \cdots + a_{N-1} x_{N-1}) \label{eq:2-2-4} \end{equation}

Differentiating terms that do not depend on $x_k$ yields 0. Therefore, only the $a_k x_k$ term remains.

\begin{equation} \dfrac{\partial}{\partial x_k} \displaystyle\sum_{n=0}^{N-1} a_n x_n = 0 + \cdots + \dfrac{\partial}{\partial x_k}(a_k x_k) + \cdots + 0 \label{eq:2-2-5} \end{equation}

Since $a_k$ is a constant, it can be factored out of the derivative. Since $\dfrac{\partial x_k}{\partial x_k} = 1$, we obtain:

\begin{equation} \dfrac{\partial}{\partial x_k}(a_k x_k) = a_k \cdot \dfrac{\partial x_k}{\partial x_k} = a_k \cdot 1 = a_k \label{eq:2-2-6} \end{equation}

Therefore, the partial derivative of the inner product with respect to $x_k$ is:

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{a}^\top \boldsymbol{x}) = a_k \label{eq:2-2-7} \end{equation}

Assembling the result \eqref{eq:2-2-7} for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{a}^\top \boldsymbol{x}) = \begin{pmatrix} \dfrac{\partial}{\partial x_0} (\boldsymbol{a}^\top \boldsymbol{x}) \\[1em] \dfrac{\partial}{\partial x_1} (\boldsymbol{a}^\top \boldsymbol{x}) \\[1em] \vdots \\[0.5em] \dfrac{\partial}{\partial x_{N-1}} (\boldsymbol{a}^\top \boldsymbol{x}) \end{pmatrix} = \begin{pmatrix} a_0 \\ a_1 \\ \vdots \\ a_{N-1} \end{pmatrix} \label{eq:2-2-8} \end{equation}

The vector on the right-hand side is $\boldsymbol{a}$ itself. Therefore, the final result is:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{a}^\top \boldsymbol{x}) = \boldsymbol{a} \label{eq:2-2-9} \end{equation}

Remark: Since $\boldsymbol{x}^\top \boldsymbol{a}$ yields the same scalar value, $\dfrac{\partial}{\partial \boldsymbol{x}} \boldsymbol{x}^\top \boldsymbol{a} = \boldsymbol{a}$ as well.

2.3 Derivative of the Squared 2-Norm

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \boldsymbol{x}^\top \boldsymbol{x} = 2\boldsymbol{x}$

Conditions: $\boldsymbol{x} \in \mathbb{R}^N$

Proof

First, we write $\boldsymbol{x}^\top \boldsymbol{x}$ in component form. It is the product of the row vector $\boldsymbol{x}^\top$ and the column vector $\boldsymbol{x}$.

\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = \begin{pmatrix} x_0 & x_1 & \cdots & x_{N-1} \end{pmatrix} \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-3-1} \end{equation}

Computing the matrix product: multiplying corresponding components and summing gives the sum of squares of each component.

\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = x_0 \cdot x_0 + x_1 \cdot x_1 + \cdots + x_{N-1} \cdot x_{N-1} \label{eq:2-3-2} \end{equation}

Simplifying, this can be written as:

\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = x_0^2 + x_1^2 + \cdots + x_{N-1}^2 \label{eq:2-3-3} \end{equation}

Expressing this concisely using summation notation, it equals the squared 2-norm $\|\boldsymbol{x}\|_2^2$.

\begin{equation} \boldsymbol{x}^\top \boldsymbol{x} = \displaystyle\sum_{n=0}^{N-1} x_n^2 = \|\boldsymbol{x}\|_2^2 \label{eq:2-3-4} \end{equation}

Next, we take the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In the sum, only the term $x_k^2$ contains $x_k$.

\begin{equation} \dfrac{\partial}{\partial x_k} \displaystyle\sum_{n=0}^{N-1} x_n^2 = \dfrac{\partial}{\partial x_k} (x_0^2 + \cdots + x_k^2 + \cdots + x_{N-1}^2) \label{eq:2-3-5} \end{equation}

Differentiating terms that do not depend on $x_k$ ($x_n^2$ for $n \neq k$) yields 0. Therefore, only the $x_k^2$ term remains.

\begin{equation} \dfrac{\partial}{\partial x_k} \displaystyle\sum_{n=0}^{N-1} x_n^2 = 0 + \cdots + \dfrac{\partial}{\partial x_k}(x_k^2) + \cdots + 0 \label{eq:2-3-6} \end{equation}

Applying the power rule (1.18): $\dfrac{d}{dx}(x^n) = n x^{n-1}$ with $n = 2$ gives:

\begin{equation} \dfrac{\partial}{\partial x_k}(x_k^2) = 2 x_k^{2-1} = 2x_k \label{eq:2-3-7} \end{equation}

Assembling the result \eqref{eq:2-3-7} for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = \begin{pmatrix} 2x_0 \\ 2x_1 \\ \vdots \\ 2x_{N-1} \end{pmatrix} \label{eq:2-3-8} \end{equation}

Factoring out the common factor 2:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2 \begin{pmatrix} x_0 \\ x_1 \\ \vdots \\ x_{N-1} \end{pmatrix} \label{eq:2-3-9} \end{equation}

The column vector on the right-hand side is $\boldsymbol{x}$ itself. Therefore, the final result is:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2\boldsymbol{x} \label{eq:2-3-10} \end{equation}

2.4 Bilinear Form

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \boldsymbol{A}^\top \boldsymbol{b}$

Conditions: $\boldsymbol{A} \in \mathbb{R}^{M \times N}$ is a constant matrix, $\boldsymbol{b} \in \mathbb{R}^M$ is a constant vector, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

First, we compute $\boldsymbol{A}\boldsymbol{x}$ in components. Since $\boldsymbol{A}$ is an $M \times N$ matrix and $\boldsymbol{x}$ is an $N$-dimensional column vector, the result is an $M$-dimensional column vector.

\begin{equation} (\boldsymbol{A}\boldsymbol{x})_i = \displaystyle\sum_{j=0}^{N-1} A_{ij} x_j \quad (i = 0, 1, \ldots, M-1) \label{eq:2-4-1} \end{equation}

That is, the $i$-th component of $\boldsymbol{A}\boldsymbol{x}$ is the inner product of the $i$-th row of $\boldsymbol{A}$ with $\boldsymbol{x}$.

Next, we compute $\boldsymbol{b}^\top (\boldsymbol{A}\boldsymbol{x})$. This is the inner product of two $M$-dimensional vectors.

\begin{equation} \boldsymbol{b}^\top (\boldsymbol{A}\boldsymbol{x}) = \displaystyle\sum_{i=0}^{M-1} b_i \cdot (\boldsymbol{A}\boldsymbol{x})_i \label{eq:2-4-2} \end{equation}

Substituting \eqref{eq:2-4-1} into \eqref{eq:2-4-2}:

\begin{equation} \boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x} = \displaystyle\sum_{i=0}^{M-1} b_i \cdot \left( \displaystyle\sum_{j=0}^{N-1} A_{ij} x_j \right) \label{eq:2-4-3} \end{equation}

Swapping the order of summation and combining into a single double sum. Since $b_i$ does not depend on $j$, it can be brought inside the inner sum.

\begin{equation} \boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x} = \displaystyle\sum_{i=0}^{M-1} \displaystyle\sum_{j=0}^{N-1} b_i A_{ij} x_j \label{eq:2-4-4} \end{equation}

Taking the partial derivative of this expression with respect to $x_k$ (the $k$-th component). Since differentiation is a linear operator, it commutes with the summation.

\begin{equation} \dfrac{\partial}{\partial x_k} \left( \displaystyle\sum_{i=0}^{M-1} \displaystyle\sum_{j=0}^{N-1} b_i A_{ij} x_j \right) = \displaystyle\sum_{i=0}^{M-1} \displaystyle\sum_{j=0}^{N-1} b_i A_{ij} \dfrac{\partial x_j}{\partial x_k} \label{eq:2-4-5} \end{equation}

Here, $\dfrac{\partial x_j}{\partial x_k}$ equals the Kronecker delta, which is 1 when $j = k$ and 0 otherwise.

\begin{equation} \dfrac{\partial x_j}{\partial x_k} = \delta_{jk} = \begin{cases} 1 & (j = k) \\ 0 & (j \neq k) \end{cases} \label{eq:2-4-6} \end{equation}

Substituting \eqref{eq:2-4-6} into \eqref{eq:2-4-5}. By $\delta_{jk}$, only the $j = k$ term survives.

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \displaystyle\sum_{i=0}^{M-1} \displaystyle\sum_{j=0}^{N-1} b_i A_{ij} \delta_{jk} \label{eq:2-4-7} \end{equation}

Applying the delta function over the sum in $j$, only the $j = k$ term is selected.

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \displaystyle\sum_{i=0}^{M-1} b_i A_{ik} \label{eq:2-4-8} \end{equation}

We interpret this result as a matrix product. $\displaystyle\sum_{i=0}^{M-1} b_i A_{ik}$ is the inner product of the row vector $\boldsymbol{b}^\top$ with the $k$-th column of $\boldsymbol{A}$. We verify that this equals $(\boldsymbol{A}^\top \boldsymbol{b})_k$.

The $k$-th row of $\boldsymbol{A}^\top$ is the transpose of the $k$-th column of $\boldsymbol{A}$. Using the definition of the transpose $(\boldsymbol{A}^\top)_{ki} = A_{ik}$:

\begin{equation} (\boldsymbol{A}^\top \boldsymbol{b})_k = \displaystyle\sum_{i=0}^{M-1} (\boldsymbol{A}^\top)_{ki} b_i = \displaystyle\sum_{i=0}^{M-1} A_{ik} b_i \label{eq:2-4-9} \end{equation}

Rearranging the order of the product, we see it matches \eqref{eq:2-4-8}.

\begin{equation} (\boldsymbol{A}^\top \boldsymbol{b})_k = \displaystyle\sum_{i=0}^{M-1} b_i A_{ik} \label{eq:2-4-10} \end{equation}

Therefore, the partial derivative with respect to $x_k$ is:

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A}^\top \boldsymbol{b})_k \label{eq:2-4-11} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ gives the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{b}^\top \boldsymbol{A} \boldsymbol{x}) = \boldsymbol{A}^\top \boldsymbol{b} \label{eq:2-4-12} \end{equation}

Remark: When $\boldsymbol{A} = \boldsymbol{I}$ (the identity matrix), $\boldsymbol{A}^\top \boldsymbol{b} = \boldsymbol{I} \boldsymbol{b} = \boldsymbol{b}$, which is consistent with the inner product formula (2.2).

2.5 Quadratic Form

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x}$

Conditions: $\boldsymbol{A} \in \mathbb{R}^{N \times N}$ is a constant matrix, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

First, we write the quadratic form $\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}$ in component form. The $i$-th component of $\boldsymbol{A}\boldsymbol{x}$ is $\displaystyle\sum_{j=0}^{N-1} A_{ij} x_j$.

\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \boldsymbol{x}^\top (\boldsymbol{A} \boldsymbol{x}) \label{eq:2-5-1} \end{equation}

Expanding according to the definition of the inner product:

\begin{equation} \boldsymbol{x}^\top (\boldsymbol{A} \boldsymbol{x}) = \displaystyle\sum_{i=0}^{N-1} x_i \cdot (\boldsymbol{A}\boldsymbol{x})_i \label{eq:2-5-2} \end{equation}

Substituting the definition of $(\boldsymbol{A}\boldsymbol{x})_i$:

\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \displaystyle\sum_{i=0}^{N-1} x_i \cdot \left( \displaystyle\sum_{j=0}^{N-1} A_{ij} x_j \right) \label{eq:2-5-3} \end{equation}

Removing the parentheses and combining into a double sum:

\begin{equation} \boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x} = \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} x_i x_j \label{eq:2-5-4} \end{equation}

Taking the partial derivative of this expression with respect to $x_k$ (the $k$-th component). In each term $A_{ij} x_i x_j$ of the double sum, $x_k$ appears when $i = k$ or $j = k$.

Using the product rule (Leibniz rule, 1.25), and noting that $\dfrac{\partial x_i}{\partial x_k} = \delta_{ik}$ (Kronecker delta):

\begin{equation} \dfrac{\partial (x_i x_j)}{\partial x_k} = \dfrac{\partial x_i}{\partial x_k} \cdot x_j + x_i \cdot \dfrac{\partial x_j}{\partial x_k} = \delta_{ik} x_j + x_i \delta_{jk} \label{eq:2-5-5} \end{equation}

Applying this result to the double sum. Since $A_{ij}$ is a constant, it can be factored out of the derivative.

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} (\delta_{ik} x_j + x_i \delta_{jk}) \label{eq:2-5-6} \end{equation}

Splitting the sum into two parts:

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} \delta_{ik} x_j + \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} x_i \delta_{jk} \label{eq:2-5-7} \end{equation}

Computing the first term. Since $\delta_{ik} = 1$ only when $i = k$, applying the delta function over the sum in $i$ selects only the $i = k$ term.

\begin{equation} \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} \delta_{ik} x_j = \displaystyle\sum_{j=0}^{N-1} A_{kj} x_j \label{eq:2-5-8} \end{equation}

This is $(\boldsymbol{A}\boldsymbol{x})_k$, i.e., the $k$-th component of $\boldsymbol{A}\boldsymbol{x}$.

\begin{equation} \displaystyle\sum_{j=0}^{N-1} A_{kj} x_j = (\boldsymbol{A}\boldsymbol{x})_k \label{eq:2-5-9} \end{equation}

Computing the second term. Since $\delta_{jk} = 1$ only when $j = k$, applying the delta function over the sum in $j$:

\begin{equation} \displaystyle\sum_{i=0}^{N-1} \displaystyle\sum_{j=0}^{N-1} A_{ij} x_i \delta_{jk} = \displaystyle\sum_{i=0}^{N-1} A_{ik} x_i \label{eq:2-5-10} \end{equation}

We interpret the second term as a component of $\boldsymbol{A}^\top \boldsymbol{x}$. Using the definition of the transpose $(\boldsymbol{A}^\top)_{ki} = A_{ik}$:

\begin{equation} (\boldsymbol{A}^\top \boldsymbol{x})_k = \displaystyle\sum_{i=0}^{N-1} (\boldsymbol{A}^\top)_{ki} x_i = \displaystyle\sum_{i=0}^{N-1} A_{ik} x_i \label{eq:2-5-11} \end{equation}

Comparing \eqref{eq:2-5-10} and \eqref{eq:2-5-11}, we see they are equal.

\begin{equation} \displaystyle\sum_{i=0}^{N-1} A_{ik} x_i = (\boldsymbol{A}^\top \boldsymbol{x})_k \label{eq:2-5-12} \end{equation}

Substituting \eqref{eq:2-5-9} and \eqref{eq:2-5-12} into \eqref{eq:2-5-7}:

\begin{equation} \dfrac{\partial}{\partial x_k} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A}\boldsymbol{x})_k + (\boldsymbol{A}^\top \boldsymbol{x})_k \label{eq:2-5-13} \end{equation}

Rearranging the right-hand side as a sum of matrices:

\begin{equation} (\boldsymbol{A}\boldsymbol{x})_k + (\boldsymbol{A}^\top \boldsymbol{x})_k = ((\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x})_k \label{eq:2-5-14} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ gives the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = (\boldsymbol{A} + \boldsymbol{A}^\top) \boldsymbol{x} \label{eq:2-5-15} \end{equation}

Remark 1: When $\boldsymbol{A}$ is a symmetric matrix ($\boldsymbol{A} = \boldsymbol{A}^\top$), $\boldsymbol{A} + \boldsymbol{A}^\top = 2\boldsymbol{A}$, so $$\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{A} \boldsymbol{x}) = 2\boldsymbol{A}\boldsymbol{x}$$

Remark 2: When $\boldsymbol{A} = \boldsymbol{I}$ (the identity matrix), $\boldsymbol{I} + \boldsymbol{I}^\top = 2\boldsymbol{I}$, so $$\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{x}^\top \boldsymbol{x}) = 2\boldsymbol{I}\boldsymbol{x} = 2\boldsymbol{x}$$ which is consistent with formula (2.3).

2.6 Derivative of the 2-Norm

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|}$

Conditions: $\boldsymbol{a} \in \mathbb{R}^N$ is a constant vector, $\boldsymbol{x} \in \mathbb{R}^N$, $\boldsymbol{x} \neq \boldsymbol{a}$

Proof

First, we write the 2-norm $\|\boldsymbol{x} - \boldsymbol{a}\|$ according to its definition.

\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{(\boldsymbol{x} - \boldsymbol{a})^\top (\boldsymbol{x} - \boldsymbol{a})} \label{eq:2-6-1} \end{equation}

Expanding in component form:

\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{\displaystyle\sum_{i=0}^{N-1} (x_i - a_i)^2} \label{eq:2-6-2} \end{equation}

For notational convenience, let $u = \|\boldsymbol{x} - \boldsymbol{a}\|^2$.

\begin{equation} u = \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \displaystyle\sum_{i=0}^{N-1} (x_i - a_i)^2 \label{eq:2-6-3} \end{equation}

Then $\|\boldsymbol{x} - \boldsymbol{a}\| = \sqrt{u} = u^{1/2}$.

\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\| = u^{1/2} \label{eq:2-6-4} \end{equation}

Taking the partial derivative of $u^{1/2}$ with respect to $x_k$, using the chain rule (1.26):

\begin{equation} \dfrac{\partial}{\partial x_k} (u^{1/2}) = \dfrac{d}{du}(u^{1/2}) \cdot \dfrac{\partial u}{\partial x_k} \label{eq:2-6-5} \end{equation}

First, we compute $\dfrac{d}{du}(u^{1/2})$. Applying the power rule (1.19) $\dfrac{d}{du}(u^n) = n u^{n-1}$:

\begin{equation} \dfrac{d}{du}(u^{1/2}) = \dfrac{1}{2} u^{1/2 - 1} = \dfrac{1}{2} u^{-1/2} \label{eq:2-6-6} \end{equation}

Expressing this in terms of the original variables:

\begin{equation} \dfrac{1}{2} u^{-1/2} = \dfrac{1}{2\sqrt{u}} = \dfrac{1}{2\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-7} \end{equation}

Next, we compute $\dfrac{\partial u}{\partial x_k}$. From the definition of $u$ in \eqref{eq:2-6-3}:

\begin{equation} \dfrac{\partial u}{\partial x_k} = \dfrac{\partial}{\partial x_k} \displaystyle\sum_{i=0}^{N-1} (x_i - a_i)^2 \label{eq:2-6-8} \end{equation}

In the sum, only the term $(x_k - a_k)^2$ contains $x_k$. The other terms do not depend on $x_k$, so differentiating them yields 0.

\begin{equation} \dfrac{\partial u}{\partial x_k} = \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 \label{eq:2-6-9} \end{equation}

Differentiating $(x_k - a_k)^2$ with respect to $x_k$, applying the chain rule (1.26):

\begin{equation} \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 = 2(x_k - a_k) \cdot \dfrac{\partial}{\partial x_k}(x_k - a_k) \label{eq:2-6-10} \end{equation}

Since $a_k$ is a constant, $\dfrac{\partial}{\partial x_k}(x_k - a_k) = 1$.

\begin{equation} \dfrac{\partial}{\partial x_k} (x_k - a_k)^2 = 2(x_k - a_k) \cdot 1 = 2(x_k - a_k) \label{eq:2-6-11} \end{equation}

Substituting \eqref{eq:2-6-7} and \eqref{eq:2-6-11} into \eqref{eq:2-6-5}:

\begin{equation} \dfrac{\partial}{\partial x_k} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{1}{2\|\boldsymbol{x} - \boldsymbol{a}\|} \cdot 2(x_k - a_k) \label{eq:2-6-12} \end{equation}

The 2 in the numerator and the 2 in the denominator cancel.

\begin{equation} \dfrac{\partial}{\partial x_k} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{x_k - a_k}{\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-13} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \begin{pmatrix} \dfrac{x_0 - a_0}{\|\boldsymbol{x} - \boldsymbol{a}\|} \\[1em] \dfrac{x_1 - a_1}{\|\boldsymbol{x} - \boldsymbol{a}\|} \\[1em] \vdots \\[0.5em] \dfrac{x_{N-1} - a_{N-1}}{\|\boldsymbol{x} - \boldsymbol{a}\|} \end{pmatrix} \label{eq:2-6-14} \end{equation}

Since the denominator $\|\boldsymbol{x} - \boldsymbol{a}\|$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{1}{\|\boldsymbol{x} - \boldsymbol{a}\|} \begin{pmatrix} x_0 - a_0 \\ x_1 - a_1 \\ \vdots \\ x_{N-1} - a_{N-1} \end{pmatrix} \label{eq:2-6-15} \end{equation}

The column vector on the right-hand side is $\boldsymbol{x} - \boldsymbol{a}$ itself. Therefore, the final result is:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\| = \dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|} \label{eq:2-6-16} \end{equation}

Remark 1: The result $\dfrac{\boldsymbol{x} - \boldsymbol{a}}{\|\boldsymbol{x} - \boldsymbol{a}\|}$ is the unit vector pointing from $\boldsymbol{a}$ toward $\boldsymbol{x}$.

Remark 2: When $\boldsymbol{a} = \boldsymbol{0}$, this becomes $\dfrac{\boldsymbol{x}}{\|\boldsymbol{x}\|}$.

Non-differentiability at $\boldsymbol{x} = \boldsymbol{a}$: At $\boldsymbol{x} = \boldsymbol{a}$, the denominator becomes 0 and the gradient is undefined. Geometrically, the norm function has a "kink" at this point, so no unique tangent direction exists. In the context of optimization, instead of the gradient at this point, the subdifferential (set of subgradients) $\{\boldsymbol{g} : \|\boldsymbol{g}\| \leq 1\}$ is used.

2.7 Derivative of the Squared 2-Norm

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = 2(\boldsymbol{x} - \boldsymbol{a})$

Conditions: $\boldsymbol{a} \in \mathbb{R}^N$ is a constant vector, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

First, we express $\|\boldsymbol{x} - \boldsymbol{a}\|^2$ in inner product form.

\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = (\boldsymbol{x} - \boldsymbol{a})^\top (\boldsymbol{x} - \boldsymbol{a}) \label{eq:2-7-1} \end{equation}

For notational convenience, define the vector-valued function $\boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x} - \boldsymbol{a}$.

\begin{equation} \boldsymbol{f}(\boldsymbol{x}) = \boldsymbol{x} - \boldsymbol{a} \label{eq:2-7-2} \end{equation}

Then \eqref{eq:2-7-1} can be written as:

\begin{equation} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-3} \end{equation}

Since $\boldsymbol{a}$ is a constant, the Jacobian matrix of $\boldsymbol{f}(\boldsymbol{x})$ is the identity matrix.

\begin{equation} \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} = \boldsymbol{I} \label{eq:2-7-4} \end{equation}

From formula (2.3), $\dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) = 2\boldsymbol{f}$.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) = 2\boldsymbol{f} \label{eq:2-7-5} \end{equation}

Applying the chain rule (1.26):

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \cdot \dfrac{\partial}{\partial \boldsymbol{f}} (\boldsymbol{f}^\top \boldsymbol{f}) \label{eq:2-7-6} \end{equation}

Substituting \eqref{eq:2-7-4} and \eqref{eq:2-7-5} into \eqref{eq:2-7-6}:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = \boldsymbol{I} \cdot 2\boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-7} \end{equation}

The product of the identity matrix and a vector is the vector itself.

\begin{equation} \boldsymbol{I} \cdot 2\boldsymbol{f}(\boldsymbol{x}) = 2\boldsymbol{f}(\boldsymbol{x}) \label{eq:2-7-8} \end{equation}

Substituting the definition of $\boldsymbol{f}(\boldsymbol{x})$ from \eqref{eq:2-7-2} gives the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \|\boldsymbol{x} - \boldsymbol{a}\|^2 = 2(\boldsymbol{x} - \boldsymbol{a}) \label{eq:2-7-9} \end{equation}

Remark: Compared to formula (2.6), the squared norm is differentiable even at $\boldsymbol{x} = \boldsymbol{a}$. This is because $\|\cdot\|^2$ is a smooth function (differentiable everywhere).

2.8 Derivative of the Inner Product (Dot Product)

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x}) + \dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x})$

Conditions: $\boldsymbol{f}: \mathbb{R}^N \to \mathbb{R}^M$, $\boldsymbol{g}: \mathbb{R}^N \to \mathbb{R}^M$ are vector-valued functions, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

We write the inner product $\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})$ in component form. Let $f_m(\boldsymbol{x})$ denote the $m$-th component of $\boldsymbol{f}(\boldsymbol{x})$, and $g_m(\boldsymbol{x})$ the $m$-th component of $\boldsymbol{g}(\boldsymbol{x})$.

\begin{equation} \boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x}) = \displaystyle\sum_{m=0}^{M-1} f_m(\boldsymbol{x}) g_m(\boldsymbol{x}) \label{eq:2-8-1} \end{equation}

Taking the partial derivative of this expression with respect to $x_n$ (the $n$-th component of $\boldsymbol{x}$):

\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \dfrac{\partial}{\partial x_n} \displaystyle\sum_{m=0}^{M-1} f_m(\boldsymbol{x}) g_m(\boldsymbol{x}) \label{eq:2-8-2} \end{equation}

Since differentiation is a linear operator, it commutes with the summation.

\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \displaystyle\sum_{m=0}^{M-1} \dfrac{\partial}{\partial x_n} (f_m(\boldsymbol{x}) g_m(\boldsymbol{x})) \label{eq:2-8-3} \end{equation}

Applying the product rule (Leibniz rule, 1.25) to each term $f_m(\boldsymbol{x}) g_m(\boldsymbol{x})$:

\begin{equation} \dfrac{\partial}{\partial x_n} (f_m(\boldsymbol{x}) g_m(\boldsymbol{x})) = \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-4} \end{equation}

Substituting \eqref{eq:2-8-4} into \eqref{eq:2-8-3}:

\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \displaystyle\sum_{m=0}^{M-1} \left( \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \right) \label{eq:2-8-5} \end{equation}

Splitting the sum into two parts:

\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \displaystyle\sum_{m=0}^{M-1} \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) + \displaystyle\sum_{m=0}^{M-1} f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-6} \end{equation}

Interpreting the first term as a matrix product. $\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}}$ is an $N \times M$ matrix whose $(n, m)$ entry is $\dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n}$.

\begin{equation} \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)_{nm} = \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} \label{eq:2-8-7} \end{equation}

The $n$-th component of the product of this matrix with $\boldsymbol{g}(\boldsymbol{x})$ equals the first sum.

\begin{equation} \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x})\right)_n = \displaystyle\sum_{m=0}^{M-1} \dfrac{\partial f_m(\boldsymbol{x})}{\partial x_n} g_m(\boldsymbol{x}) \label{eq:2-8-8} \end{equation}

Similarly, the second term can be interpreted as:

\begin{equation} \left(\dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x})\right)_n = \displaystyle\sum_{m=0}^{M-1} \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} f_m(\boldsymbol{x}) \label{eq:2-8-9} \end{equation}

Since scalar multiplication is commutative, $f_m(\boldsymbol{x}) \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} = \dfrac{\partial g_m(\boldsymbol{x})}{\partial x_n} f_m(\boldsymbol{x})$.

Replacing the first and second terms of \eqref{eq:2-8-6} with \eqref{eq:2-8-8} and \eqref{eq:2-8-9}:

\begin{equation} \dfrac{\partial}{\partial x_n} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \left(\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x})\right)_n + \left(\dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x})\right)_n \label{eq:2-8-10} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ gives the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\boldsymbol{f}(\boldsymbol{x})^\top \boldsymbol{g}(\boldsymbol{x})) = \dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x}) + \dfrac{\partial \boldsymbol{g}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{f}(\boldsymbol{x}) \label{eq:2-8-11} \end{equation}

Dimension check:

$\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \in \mathbb{R}^{N \times M}$ ($N$-by-$M$ matrix)
$\boldsymbol{g}(\boldsymbol{x}) \in \mathbb{R}^M$ ($M$-dimensional column vector)
Product $\dfrac{\partial \boldsymbol{f}(\boldsymbol{x})}{\partial \boldsymbol{x}} \boldsymbol{g}(\boldsymbol{x}) \in \mathbb{R}^N$ ($N$-dimensional column vector)

Therefore, the result is an $N$-dimensional column vector, consistent with the dimension of the gradient vector.

2.9 Product Rule for Scalar Functions

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x}) g(\boldsymbol{x})) = f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}} + g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$, $g: \mathbb{R}^N \to \mathbb{R}$ are scalar functions, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $n$-th component of the gradient vector. We take the partial derivative of the product $f(\boldsymbol{x}) g(\boldsymbol{x})$ with respect to $x_n$.

Applying the standard product rule (Leibniz rule, 1.25):

\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial x_n} = \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \cdot g(\boldsymbol{x}) + f(\boldsymbol{x}) \cdot \dfrac{\partial g(\boldsymbol{x})}{\partial x_n} \label{eq:2-9-1} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} g(\boldsymbol{x}) + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-2} \end{equation}

Rewriting this as a sum of two vectors:

\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} g(\boldsymbol{x}) \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} g(\boldsymbol{x}) \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} g(\boldsymbol{x}) \end{pmatrix} + \begin{pmatrix} \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-3} \end{equation}

Since $f(\boldsymbol{x})$ and $g(\boldsymbol{x})$ are scalars, they can be factored out as common factors from each component of the vectors.

\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = g(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} + f(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-9-4} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial (f(\boldsymbol{x}) g(\boldsymbol{x}))}{\partial \boldsymbol{x}} = g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} + f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-9-5} \end{equation}

2.10 Sum and Difference Rule

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (f \pm g) = \dfrac{\partial f}{\partial \boldsymbol{x}} \pm \dfrac{\partial g}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$, $g: \mathbb{R}^N \to \mathbb{R}$ are scalar functions, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $n$-th component of the gradient vector. We take the partial derivative of $f + g$ with respect to $x_n$.

By the linearity of differentiation (1.24), the derivative of a sum equals the sum of derivatives.

\begin{equation} \dfrac{\partial (f + g)}{\partial x_n} = \dfrac{\partial f}{\partial x_n} + \dfrac{\partial g}{\partial x_n} \label{eq:2-10-1} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f}{\partial x_0} + \dfrac{\partial g}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} + \dfrac{\partial g}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} + \dfrac{\partial g}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-10-2} \end{equation}

Rewriting this as a sum of two vectors:

\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \begin{pmatrix} \dfrac{\partial f}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} + \begin{pmatrix} \dfrac{\partial g}{\partial x_0} \\[1em] \dfrac{\partial g}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial g}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-10-3} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial (f + g)}{\partial \boldsymbol{x}} = \dfrac{\partial f}{\partial \boldsymbol{x}} + \dfrac{\partial g}{\partial \boldsymbol{x}} \label{eq:2-10-4} \end{equation}

The same argument applies to the difference $f - g$. By the linearity of differentiation, the derivative of a difference equals the difference of derivatives.

\begin{equation} \dfrac{\partial (f - g)}{\partial \boldsymbol{x}} = \dfrac{\partial f}{\partial \boldsymbol{x}} - \dfrac{\partial g}{\partial \boldsymbol{x}} \label{eq:2-10-5} \end{equation}

2.11 Scalar Multiple Rule

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} (c \cdot f) = c \dfrac{\partial f}{\partial \boldsymbol{x}}$

Conditions: $c \in \mathbb{R}$ is a scalar constant, $f: \mathbb{R}^N \to \mathbb{R}$ is a scalar function, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $n$-th component of the gradient vector. We take the partial derivative of $c \cdot f$ with respect to $x_n$.

Since $c$ is a constant that does not depend on $x_n$, it can be factored out of the derivative. This is part of the linearity of differentiation (1.24).

\begin{equation} \dfrac{\partial (c \cdot f)}{\partial x_n} = c \cdot \dfrac{\partial f}{\partial x_n} \label{eq:2-11-1} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = \begin{pmatrix} \displaystyle c \cdot \dfrac{\partial f}{\partial x_0} \\[1em] \displaystyle c \cdot \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle c \cdot \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-11-2} \end{equation}

Since $c$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = c \cdot \begin{pmatrix} \dfrac{\partial f}{\partial x_0} \\[1em] \dfrac{\partial f}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-11-3} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial (c \cdot f)}{\partial \boldsymbol{x}} = c \dfrac{\partial f}{\partial \boldsymbol{x}} \label{eq:2-11-4} \end{equation}

2.12 Quotient Rule

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} = \dfrac{1}{g(\boldsymbol{x})^2}\left(g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}}\right)$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$, $g: \mathbb{R}^N \to \mathbb{R}$ are scalar functions, $\boldsymbol{x} \in \mathbb{R}^N$, $g(\boldsymbol{x}) \neq 0$

Proof

Consider the $n$-th component of the gradient vector. We take the partial derivative of $f(\boldsymbol{x})/g(\boldsymbol{x})$ with respect to $x_n$.

Applying the standard quotient rule (1.28):

\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{g(\boldsymbol{x}) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} - f(\boldsymbol{x}) \cdot \dfrac{\partial g(\boldsymbol{x})}{\partial x_n}}{g(\boldsymbol{x})^2} \label{eq:2-12-1} \end{equation}

Factoring out $\dfrac{1}{g(\boldsymbol{x})^2}$:

\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_n} \right) \label{eq:2-12-2} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \begin{pmatrix} \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \right) \\[1em] \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_1} \right) \\[1em] \vdots \\[0.5em] \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \right) \end{pmatrix} \label{eq:2-12-3} \end{equation}

Since $f(\boldsymbol{x})$, $g(\boldsymbol{x})$, and $1/g(\boldsymbol{x})^2$ are scalars, they can be factored out of the vector.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} - f(\boldsymbol{x}) \begin{pmatrix} \dfrac{\partial g(\boldsymbol{x})}{\partial x_0} \\[1em] \vdots \\[0.5em] \dfrac{\partial g(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \right) \label{eq:2-12-4} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{f(\boldsymbol{x})}{g(\boldsymbol{x})} \right) = \dfrac{1}{g(\boldsymbol{x})^2} \left( g(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} - f(\boldsymbol{x}) \dfrac{\partial g(\boldsymbol{x})}{\partial \boldsymbol{x}} \right) \label{eq:2-12-5} \end{equation}

2.13 Derivative of the Reciprocal

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \dfrac{1}{f(\boldsymbol{x})} = -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$ is a scalar function, $\boldsymbol{x} \in \mathbb{R}^N$, $f(\boldsymbol{x}) \neq 0$

Proof

Consider the $n$-th component of the gradient vector. We take the partial derivative of $1/f(\boldsymbol{x})$ with respect to $x_n$.

Since $1/f(\boldsymbol{x}) = f(\boldsymbol{x})^{-1}$, we prepare to apply the chain rule (1.26) by writing it in this form.

\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = \dfrac{\partial}{\partial x_n} (f(\boldsymbol{x})^{-1}) \label{eq:2-13-1} \end{equation}

By the chain rule (1.26), this becomes "derivative of the outer function" times "derivative of the inner function." Letting $t = f(\boldsymbol{x})$:

\begin{equation} \dfrac{\partial}{\partial x_n} (f(\boldsymbol{x})^{-1}) = \dfrac{d}{dt}(t^{-1}) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \label{eq:2-13-2} \end{equation}

First, we compute $\dfrac{d}{dt}(t^{-1})$. Applying the power rule (1.19) $\dfrac{d}{dt}(t^n) = n t^{n-1}$ with $n = -1$:

\begin{equation} \dfrac{d}{dt}(t^{-1}) = (-1) \cdot t^{-1-1} = -t^{-2} \label{eq:2-13-3} \end{equation}

Rewriting this in fractional form:

\begin{equation} -t^{-2} = -\dfrac{1}{t^2} = -\dfrac{1}{f(\boldsymbol{x})^2} \label{eq:2-13-4} \end{equation}

Substituting \eqref{eq:2-13-4} into \eqref{eq:2-13-2}:

\begin{equation} \dfrac{\partial}{\partial x_n} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_n} \label{eq:2-13-5} \end{equation}

Assembling this result for all $n = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = \begin{pmatrix} \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-13-6} \end{equation}

Since $-\dfrac{1}{f(\boldsymbol{x})^2}$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-13-7} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} \left( \dfrac{1}{f(\boldsymbol{x})} \right) = -\dfrac{1}{f(\boldsymbol{x})^2} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-13-8} \end{equation}

Remark: This is a special case of the quotient rule (2.12) with the numerator set to the constant 1.

2.14 Power Rule

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} f(\boldsymbol{x})^n = n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$ is a scalar function, $n \in \mathbb{R}$ is a constant, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $k$-th component of the gradient vector. We take the partial derivative of $f(\boldsymbol{x})^n$ with respect to $x_k$.

Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function." Letting $t = f(\boldsymbol{x})$:

\begin{equation} \dfrac{\partial}{\partial x_k} (f(\boldsymbol{x})^n) = \dfrac{d}{dt}(t^n) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-14-1} \end{equation}

First, we compute $\dfrac{d}{dt}(t^n)$. Applying the power rule (1.19):

\begin{equation} \dfrac{d}{dt}(t^n) = n \cdot t^{n-1} \label{eq:2-14-2} \end{equation}

Substituting $t = f(\boldsymbol{x})$:

\begin{equation} \dfrac{d}{dt}(t^n) = n \cdot f(\boldsymbol{x})^{n-1} \label{eq:2-14-3} \end{equation}

Substituting \eqref{eq:2-14-3} into \eqref{eq:2-14-1}:

\begin{equation} \dfrac{\partial}{\partial x_k} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-14-4} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = \begin{pmatrix} \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-14-5} \end{equation}

Since $n f(\boldsymbol{x})^{n-1}$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-14-6} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (f(\boldsymbol{x})^n) = n f(\boldsymbol{x})^{n-1} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-14-7} \end{equation}

Remark:

When $n = 1/2$: we obtain the derivative of the square root $\sqrt{f(\boldsymbol{x})} = f(\boldsymbol{x})^{1/2}$. $\dfrac{\partial}{\partial \boldsymbol{x}} \sqrt{f(\boldsymbol{x})} = \dfrac{1}{2\sqrt{f(\boldsymbol{x})}} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$
When $n = -1$: we recover the derivative of the reciprocal $1/f(\boldsymbol{x}) = f(\boldsymbol{x})^{-1}$ from (2.13).
When $n = 2$: the derivative of the square $f(\boldsymbol{x})^2$. $\dfrac{\partial}{\partial \boldsymbol{x}} f(\boldsymbol{x})^2 = 2f(\boldsymbol{x}) \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

2.15 Derivative of the Exponential Function

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} e^{f(\boldsymbol{x})} = e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}$ is a scalar function, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $k$-th component of the gradient vector. We take the partial derivative of $e^{f(\boldsymbol{x})}$ with respect to $x_k$.

Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function."

\begin{equation} \dfrac{\partial}{\partial x_k} (e^{f(\boldsymbol{x})}) = \dfrac{d}{df}(e^f) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-15-1} \end{equation}

Computing $\dfrac{d}{df}(e^f)$: the derivative of the exponential function $e^f$ is $e^f$ itself (1.20).

\begin{equation} \dfrac{d}{df}(e^f) = e^f \label{eq:2-15-2} \end{equation}

Substituting \eqref{eq:2-15-2} into \eqref{eq:2-15-1}:

\begin{equation} \dfrac{\partial}{\partial x_k} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-15-3} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = \begin{pmatrix} \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-15-4} \end{equation}

Since $e^{f(\boldsymbol{x})}$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-15-5} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (e^{f(\boldsymbol{x})}) = e^{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-15-6} \end{equation}

Remark: For base $a$ ($a > 0, a \neq 1$), since $a^{f(\boldsymbol{x})} = e^{f(\boldsymbol{x}) \log a}$, by the chain rule (1.26): $$\dfrac{\partial}{\partial \boldsymbol{x}} a^{f(\boldsymbol{x})} = a^{f(\boldsymbol{x})} \log a \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$$

2.16 Derivative of the Logarithmic Function

Formula: $\dfrac{\partial}{\partial \boldsymbol{x}} \log f(\boldsymbol{x}) = \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$

Conditions: $f: \mathbb{R}^N \to \mathbb{R}_{>0}$ is a positive-valued scalar function, $\boldsymbol{x} \in \mathbb{R}^N$

Proof

Consider the $k$-th component of the gradient vector. We take the partial derivative of $\log f(\boldsymbol{x})$ with respect to $x_k$.

Applying the chain rule (1.26): this becomes "derivative of the outer function" times "derivative of the inner function."

\begin{equation} \dfrac{\partial}{\partial x_k} (\log f(\boldsymbol{x})) = \dfrac{d}{df}(\log f) \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-16-1} \end{equation}

Computing $\dfrac{d}{df}(\log f)$: the derivative of the natural logarithm $\log f$ is $\dfrac{1}{f}$ (1.21).

\begin{equation} \dfrac{d}{df}(\log f) = \dfrac{1}{f} \label{eq:2-16-2} \end{equation}

Substituting \eqref{eq:2-16-2} into \eqref{eq:2-16-1}:

\begin{equation} \dfrac{\partial}{\partial x_k} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \cdot \dfrac{\partial f(\boldsymbol{x})}{\partial x_k} \label{eq:2-16-3} \end{equation}

Assembling this result for all $k = 0, 1, \ldots, N-1$ to form the gradient vector:

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \begin{pmatrix} \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \displaystyle \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-16-4} \end{equation}

Since $\dfrac{1}{f(\boldsymbol{x})}$ is common to all components, it can be factored out as a scalar.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \begin{pmatrix} \dfrac{\partial f(\boldsymbol{x})}{\partial x_0} \\[1em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_1} \\[1em] \vdots \\[0.5em] \dfrac{\partial f(\boldsymbol{x})}{\partial x_{N-1}} \end{pmatrix} \label{eq:2-16-5} \end{equation}

Using the definition of the gradient vector to simplify, we obtain the final result.

\begin{equation} \dfrac{\partial}{\partial \boldsymbol{x}} (\log f(\boldsymbol{x})) = \dfrac{1}{f(\boldsymbol{x})} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}} \label{eq:2-16-6} \end{equation}

Remark: For base $a$ ($a > 0, a \neq 1$), since $\log_a f(\boldsymbol{x}) = \dfrac{\log f(\boldsymbol{x})}{\log a}$: $$\dfrac{\partial}{\partial \boldsymbol{x}} \log_a f(\boldsymbol{x}) = \dfrac{1}{f(\boldsymbol{x}) \log a} \dfrac{\partial f(\boldsymbol{x})}{\partial \boldsymbol{x}}$$

Proofs Chapter 2: Scalar by Vector Derivatives

2.1 Derivative of a Constant

Proof

2.2 Derivative of an Inner Product

Proof

2.3 Derivative of the Squared 2-Norm

Proof

2.4 Bilinear Form

Proof

2.5 Quadratic Form

Proof

2.6 Derivative of the 2-Norm

Proof

2.7 Derivative of the Squared 2-Norm

Proof

2.8 Derivative of the Inner Product (Dot Product)

Proof

2.9 Product Rule for Scalar Functions

Proof

2.10 Sum and Difference Rule

Proof

2.11 Scalar Multiple Rule

Proof

2.12 Quotient Rule

Proof

2.13 Derivative of the Reciprocal

Proof

2.14 Power Rule

Proof

2.15 Derivative of the Exponential Function

Proof

2.16 Derivative of the Logarithmic Function

Proof

References