What is the definition of the derivative (epsilon-delta formulation)?

The derivative of a function f(x) at a point a is defined by the limit f'(a) = lim_{h->0} (f(a+h) - f(a))/h. When this limit exists, f is said to be differentiable at a.

How is the product rule (Leibniz rule) proved?

The proof of (fg)' = f'g + fg' proceeds by adding and subtracting f(x+h)g(x) in the difference quotient (f(x+h)g(x+h) - f(x)g(x))/h to split it into two terms, then taking the limit as h->0. The continuity of f ensures f(x+h)->f(x), yielding the result.

What is the chain rule for composite functions?

The derivative of a composite function f(g(x)) is (f composed with g)'(x) = f'(g(x)) * g'(x). This multiplies the derivative of the outer function by the derivative of the inner function. In matrix calculus, this generalizes to the chain rule.

Proofs Chapter 1: Scalar Derivatives of a Single Variable

Proofs Chapter 1: Scalar Derivatives of Single Variable

1. Scalar Derivatives of a Single Variable

This chapter rigorously proves differentiation formulas for scalar single-variable functions, from definitions through basic formulas, which form the foundation of matrix calculus. Many formulas in matrix calculus are derived by applying single-variable differentiation results component-wise. Therefore, understanding single-variable differentiation is an essential prerequisite for studying matrix calculus.

Notation Convention for This Series
This proof series adopts the denominator layout for matrix differentiation in Chapter 2 and beyond. In the denominator layout, differentiating a scalar by a vector yields a column vector, and differentiating a vector by a scalar yields a row vector. See Layout Conventions for details.

Roadmap for This Chapter

This chapter develops the theory of differentiation in the following structure. Each theorem and formula is arranged in a logical order such that it is proved before being used.

1.1 Definition and Basic Concepts of Differentiation (1.1--1.3): Definition of the derivative, derivative function, differentiability and continuity
1.2 Fundamental Theorems and Identities (1.4--1.10): Pascal's identity, binomial theorem, trigonometric identities and fundamental limits
1.3 Fundamental Theorems of Linear Algebra (1.11--1.15): Properties of trace and determinant
1.4 Derivatives of Basic Functions (1.16--1.23): Constant, power, exponential, and logarithmic functions
1.5 Rules of Differentiation (1.24--1.29): Linearity, product, quotient, chain, and inverse function rules
1.6 Derivatives of Trigonometric Functions (1.30--1.33): sin, cos, tan, etc.
1.7 Derivatives of Inverse Trigonometric Functions (1.34--1.36): arcsin, arccos, arctan
1.8 Derivatives of Hyperbolic Functions (1.37--1.39): sinh, cosh, tanh
1.9 Other Important Differentiation Formulas (1.40--1.43): Absolute value, sigmoid, Softplus, Leibniz formula

1.1 Definition and Basic Concepts of Differentiation

The derivative of a function $f(x)$ at a point $x = a$ represents the instantaneous rate of change at that point. This is a fundamental concept in various fields, including velocity in physics (rate of change of position with respect to time) and marginal utility in economics (rate of change of utility with respect to consumption).

1.1 Definition of the Derivative at a Point

Definition: $\displaystyle f'(a) = \lim_{h \to 0} \dfrac{f(a+h) - f(a)}{h}$

Condition: The limit exists

Explanation

We explain this definition geometrically.

Consider the slope of the line (secant line) connecting the points $(a, f(a))$ and $(a+h, f(a+h))$.

\begin{equation}\text{Slope of secant} = \dfrac{f(a+h) - f(a)}{(a+h) - a} = \dfrac{f(a+h) - f(a)}{h} \label{eq:1-1-1}\end{equation}

As $h \to 0$, the point $(a+h, f(a+h))$ approaches the point $(a, f(a))$ along the curve. The secant line then approaches the tangent line.

\begin{equation}f'(a) = \lim_{h \to 0} \dfrac{f(a+h) - f(a)}{h} \label{eq:1-1-2}\end{equation}

Terminology: When the limit in $\eqref{eq:1-1-2}$ exists, $f$ is said to be differentiable at $a$, and the limit value $f'(a)$ is called the derivative at $a$.

Note: The derivative represents the slope of the tangent line. If $f'(a) > 0$, then $f$ is increasing at $a$; if $f'(a) < 0$, then $f$ is decreasing at $a$.

Besides $f'(a)$, the derivative is also written as $\displaystyle \dfrac{df}{dx}\bigg|_{x=a}$.

1.2 Definition of the Derivative Function

Definition: $\displaystyle f'(x) = \lim_{h \to 0} \dfrac{f(x+h) - f(x)}{h}$

Condition: The limit exists at each point

Explanation

The derivative $f'(a)$ was the value at a specific point $a$. By replacing $a$ with a variable $x$, we define the derivative as a function of $x$.

\begin{equation}f'(x) = \lim_{h \to 0} \dfrac{f(x+h) - f(x)}{h} \label{eq:1-2-1}\end{equation}

There are several notational conventions for the derivative function.

Leibniz notation: $\displaystyle \dfrac{df}{dx}$, $\displaystyle \dfrac{d}{dx}f(x)$

Lagrange notation: $f'(x)$

Newton notation: $\dot{f}$ (commonly used for time derivatives)

Higher-order derivatives are defined as follows.

\begin{equation}f''(x) = \dfrac{d^2f}{dx^2} = \dfrac{d}{dx}\left(\dfrac{df}{dx}\right) \label{eq:1-2-2}\end{equation}

\begin{equation}f^{(n)}(x) = \dfrac{d^n f}{dx^n} = \dfrac{d}{dx}\left(\dfrac{d^{n-1}f}{dx^{n-1}}\right) \label{eq:1-2-3}\end{equation}

Terminology: When the limit in $\eqref{eq:1-2-1}$ exists at each point, $f'(x)$ is called the derivative function of $f$.

Note: In matrix calculus, a scalar function $f$ is partially differentiated with respect to each component $X_{ij}$ of a matrix $\boldsymbol{X}$. This corresponds to applying single-variable differentiation to each component.

1.3 Differentiability and Continuity

Theorem: $f$ is differentiable at $a$ $\Rightarrow$ $f$ is continuous at $a$

Condition: The converse does not hold in general (e.g., $f(x) = |x|$ is continuous but not differentiable at $x = 0$)

Proof

Premise: This proof uses basic properties of limits (limit laws for sums, products, and scalar multiples) as known results.

Assume that $f$ is differentiable at $a$. By the definition of differentiability, the limit

\begin{equation}f'(a) = \lim_{h \to 0} \dfrac{f(a+h) - f(a)}{h} \label{eq:1-3-1}\end{equation}

exists.

To show continuity, we need to prove that $\lim_{h \to 0} f(a+h) = f(a)$.

For $h \neq 0$, we rewrite $f(a+h) - f(a)$ as follows.

\begin{equation}f(a+h) - f(a) = \dfrac{f(a+h) - f(a)}{h} \cdot h \label{eq:1-3-2}\end{equation}

Taking the limit as $h \to 0$ on both sides of $\eqref{eq:1-3-2}$, by the product rule for limits $\lim (AB) = (\lim A)(\lim B)$ (when both limits exist),

\begin{equation}\lim_{h \to 0} [f(a+h) - f(a)] = \lim_{h \to 0} \dfrac{f(a+h) - f(a)}{h} \cdot \lim_{h \to 0} h \label{eq:1-3-3}\end{equation}

By $\eqref{eq:1-3-1}$, the first factor converges to $f'(a)$ (a finite value), and the second factor converges to 0.

\begin{equation}\lim_{h \to 0} [f(a+h) - f(a)] = f'(a) \cdot 0 = 0 \label{eq:1-3-4}\end{equation}

From $\eqref{eq:1-3-4}$,

\begin{equation}\lim_{h \to 0} f(a+h) = f(a) \label{eq:1-3-5}\end{equation}

$\eqref{eq:1-3-5}$ means that $f$ is continuous at $a$.

Note: As a counterexample for the converse, consider $f(x) = |x|$. It is continuous at $x = 0$, but $\displaystyle \lim_{h \to 0^+} \dfrac{|h|}{h} = 1$ and $\displaystyle \lim_{h \to 0^-} \dfrac{|h|}{h} = -1$ differ, so it is not differentiable.

1.2 Fundamental Theorems and Identities

This section proves fundamental theorems and identities needed for deriving differentiation formulas. These serve as the foundation referenced in subsequent proofs.

1.4 Pascal's Identity

Formula: $\displaystyle \binom{n}{k-1} + \binom{n}{k} = \binom{n+1}{k}$

Condition: $n \geq 0$, $1 \leq k \leq n$

Proof

We compute directly from the definition of binomial coefficients.

By the definition of binomial coefficients,

\begin{equation}\binom{n}{k-1} = \dfrac{n!}{(k-1)!(n-k+1)!}, \quad \binom{n}{k} = \dfrac{n!}{k!(n-k)!} \label{eq:1-4-1}\end{equation}

We compute the left-hand side. To find a common denominator, we use $k!(n-k+1)!$.

\begin{equation}\binom{n}{k-1} + \binom{n}{k} = \dfrac{n! \cdot k}{k!(n-k+1)!} + \dfrac{n! \cdot (n-k+1)}{k!(n-k+1)!} \label{eq:1-4-2}\end{equation}

Simplifying the numerator,

\begin{equation}\binom{n}{k-1} + \binom{n}{k} = \dfrac{n! (k + n - k + 1)}{k!(n-k+1)!} = \dfrac{n! (n + 1)}{k!(n-k+1)!} \label{eq:1-4-3}\end{equation}

Since $n! (n + 1) = (n + 1)!$,

\begin{equation}\binom{n}{k-1} + \binom{n}{k} = \dfrac{(n+1)!}{k!(n+1-k)!} = \binom{n+1}{k} \label{eq:1-4-4}\end{equation}

Note: Used in the proof of the binomial theorem 1.5 by induction, and in Leibniz's formula 1.43. It expresses the fact that each entry in Pascal's triangle is the sum of the two entries above it.

1.5 Binomial Theorem

Formula: $\displaystyle (x + y)^n = \displaystyle\sum_{k=0}^{n} \binom{n}{k} x^{n-k} y^k$

Condition: $n$ is a non-negative integer, $\binom{n}{k} = \dfrac{n!}{k!(n-k)!}$ is the binomial coefficient

Proof

We prove this by mathematical induction.

Base case: For $n = 0$, the left-hand side is $(x + y)^0 = 1$, and the right-hand side is $\displaystyle\sum_{k=0}^{0} \binom{0}{0} x^0 y^0 = 1$, which agree.

Inductive step: Assume the formula holds for $n = m$.

\begin{equation}(x + y)^m = \displaystyle\sum_{k=0}^{m} \binom{m}{k} x^{m-k} y^k \label{eq:1-5-1}\end{equation}

Consider the case $n = m + 1$.

\begin{equation}(x + y)^{m+1} = (x + y)(x + y)^m = (x + y) \displaystyle\sum_{k=0}^{m} \binom{m}{k} x^{m-k} y^k \label{eq:1-5-2}\end{equation}

Expanding $\eqref{eq:1-5-2}$,

\begin{equation}(x + y)^{m+1} = \displaystyle\sum_{k=0}^{m} \binom{m}{k} x^{m+1-k} y^k + \displaystyle\sum_{k=0}^{m} \binom{m}{k} x^{m-k} y^{k+1} \label{eq:1-5-3}\end{equation}

Substituting $j = k + 1$ in the second sum (so $k = j - 1$),

\begin{equation}\displaystyle\sum_{k=0}^{m} \binom{m}{k} x^{m-k} y^{k+1} = \displaystyle\sum_{j=1}^{m+1} \binom{m}{j-1} x^{m+1-j} y^j \label{eq:1-5-4}\end{equation}

Combining $\eqref{eq:1-5-3}$ and $\eqref{eq:1-5-4}$,

\begin{equation}(x + y)^{m+1} = \binom{m}{0} x^{m+1} + \displaystyle\sum_{k=1}^{m} \left[ \binom{m}{k} + \binom{m}{k-1} \right] x^{m+1-k} y^k + \binom{m}{m} y^{m+1} \label{eq:1-5-5}\end{equation}

Using Pascal's identity (1.4) $\binom{m}{k} + \binom{m}{k-1} = \binom{m+1}{k}$, along with $\binom{m}{0} = \binom{m+1}{0} = 1$ and $\binom{m}{m} = \binom{m+1}{m+1} = 1$,

\begin{equation}(x + y)^{m+1} = \displaystyle\sum_{k=0}^{m+1} \binom{m+1}{k} x^{m+1-k} y^k \label{eq:1-5-6}\end{equation}

By mathematical induction, the binomial theorem holds for all non-negative integers $n$.

Note: Used in the proof of the power rule 1.18 to expand $(x + h)^n$.

1.6 Pythagorean Identity

Formula: $\sin^2 x + \cos^2 x = 1$

Condition: $x \in \mathbb{R}$

Proof

We prove this from the definition of the unit circle.

The unit circle is a circle of radius 1 centered at the origin, with the equation

\begin{equation}x^2 + y^2 = 1 \label{eq:1-6-1}\end{equation}

The coordinates of the point on the unit circle corresponding to angle $\theta$ are defined as $(\cos\theta, \sin\theta)$.

\begin{equation}(x, y) = (\cos\theta, \sin\theta) \label{eq:1-6-2}\end{equation}

Substituting $\eqref{eq:1-6-2}$ into $\eqref{eq:1-6-1}$,

\begin{equation}\cos^2\theta + \sin^2\theta = 1 \label{eq:1-6-3}\end{equation}

Renaming the variable, for all $x \in \mathbb{R}$,

\begin{equation}\sin^2 x + \cos^2 x = 1 \label{eq:1-6-4}\end{equation}

Note: Used in the derivative of the tangent function 1.32 in the form $\cos^2 x = 1 - \sin^2 x$. Also used in deriving $1 + \tan^2 x = \sec^2 x$.

1.7 Addition Formulas for Trigonometric Functions

Formulas: $\sin(x + y) = \sin x \cos y + \cos x \sin y$, $\cos(x + y) = \cos x \cos y - \sin x \sin y$

Condition: $x, y \in \mathbb{R}$

Proof

We prove this using the unit circle and rotation matrices.

The point on the unit circle corresponding to angle $\theta$ is $(\cos\theta, \sin\theta)$. This equals the point $(1, 0)$ rotated about the origin by angle $\theta$.

The rotation matrix for angle $\theta$ is

\begin{equation}R(\theta) = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \label{eq:1-7-1}\end{equation}

By the composition of rotations, a rotation by angle $x$ followed by a rotation by angle $y$ equals a rotation by angle $x + y$.

\begin{equation}R(x + y) = R(y) R(x) \label{eq:1-7-2}\end{equation}

Computing the right-hand side of $\eqref{eq:1-7-2}$,

\begin{equation}R(y) R(x) = \begin{pmatrix} \cos y & -\sin y \\ \sin y & \cos y \end{pmatrix} \begin{pmatrix} \cos x & -\sin x \\ \sin x & \cos x \end{pmatrix} \label{eq:1-7-3}\end{equation}

Expanding the matrix product,

\begin{equation}R(y) R(x) = \begin{pmatrix} \cos y \cos x - \sin y \sin x & -\cos y \sin x - \sin y \cos x \\ \sin y \cos x + \cos y \sin x & -\sin y \sin x + \cos y \cos x \end{pmatrix} \label{eq:1-7-4}\end{equation}

The left-hand side of $\eqref{eq:1-7-2}$ is

\begin{equation}R(x + y) = \begin{pmatrix} \cos(x + y) & -\sin(x + y) \\ \sin(x + y) & \cos(x + y) \end{pmatrix} \label{eq:1-7-5}\end{equation}

Comparing the components of $\eqref{eq:1-7-4}$ and $\eqref{eq:1-7-5}$,

\begin{equation}\cos(x + y) = \cos x \cos y - \sin x \sin y \label{eq:1-7-6}\end{equation}

\begin{equation}\sin(x + y) = \sin x \cos y + \cos x \sin y \label{eq:1-7-7}\end{equation}

Note: Used in the derivatives of sine and cosine 1.30, 1.31 to expand $\sin(x + h)$ and $\cos(x + h)$.

1.8 Fundamental Limit of the Sine Function

Formula: $\displaystyle \lim_{x \to 0} \dfrac{\sin x}{x} = 1$

Condition: $x$ is in radians

Proof

We give a geometric proof using the unit circle. Consider the case $0 < x < \dfrac{\pi}{2}$.

On the unit circle, consider the arc, chord, and tangent corresponding to central angle $x$ (in radians). Let $O$ be the origin, $A = (1, 0)$ be a point on the unit circle, $B = (\cos x, \sin x)$ be the point corresponding to angle $x$, and $C = (1, \tan x)$ be the intersection of the tangent to the unit circle at $A$ with the extension of line $OB$.

We compare the areas of these figures.

\begin{equation}\text{Area of } \triangle OAB < \text{Area of sector } OAB < \text{Area of } \triangle OAC \label{eq:1-8-1}\end{equation}

Computing each area,

\begin{equation}\triangle OAB = \dfrac{1}{2} \cdot 1 \cdot \sin x = \dfrac{\sin x}{2} \label{eq:1-8-2}\end{equation}

\begin{equation}\text{Sector } OAB = \dfrac{1}{2} \cdot 1^2 \cdot x = \dfrac{x}{2} \label{eq:1-8-3}\end{equation}

\begin{equation}\triangle OAC = \dfrac{1}{2} \cdot 1 \cdot \tan x = \dfrac{\tan x}{2} \label{eq:1-8-4}\end{equation}

Substituting $\eqref{eq:1-8-2}$, $\eqref{eq:1-8-3}$, $\eqref{eq:1-8-4}$ into $\eqref{eq:1-8-1}$,

\begin{equation}\dfrac{\sin x}{2} < \dfrac{x}{2} < \dfrac{\tan x}{2} = \dfrac{\sin x}{2 \cos x} \label{eq:1-8-5}\end{equation}

Dividing throughout by $\sin x > 0$ (since $0 < x < \dfrac{\pi}{2}$) and taking reciprocals,

\begin{equation}1 > \dfrac{\sin x}{x} > \cos x \label{eq:1-8-6}\end{equation}

That is,

\begin{equation}\cos x < \dfrac{\sin x}{x} < 1 \label{eq:1-8-7}\end{equation}

As $x \to 0^+$, $\cos x \to 1$, so by the squeeze theorem,

\begin{equation}\lim_{x \to 0^+} \dfrac{\sin x}{x} = 1 \label{eq:1-8-8}\end{equation}

Since $\dfrac{\sin x}{x}$ is an even function (because $\sin(-x) = -\sin x$ implies $\dfrac{\sin(-x)}{-x} = \dfrac{\sin x}{x}$), the limit is the same as $x \to 0^-$.

\begin{equation}\lim_{x \to 0} \dfrac{\sin x}{x} = 1 \label{eq:1-8-9}\end{equation}

Note: This limit is essential in the proof of the derivative of the sine function 1.30.

1.9 Fundamental Limit of the Cosine Function

Formula: $\displaystyle \lim_{x \to 0} \dfrac{1 - \cos x}{x} = 0$

Condition: $x$ is in radians

Proof

We prove this using the half-angle formula and 1.8.

By the half-angle formula,

\begin{equation}1 - \cos x = 2\sin^2\dfrac{x}{2} \label{eq:1-9-1}\end{equation}

Using $\eqref{eq:1-9-1}$,

\begin{equation}\dfrac{1 - \cos x}{x} = \dfrac{2\sin^2\dfrac{x}{2}}{x} = \dfrac{2\sin^2\dfrac{x}{2}}{2 \cdot \dfrac{x}{2}} = \sin\dfrac{x}{2} \cdot \dfrac{\sin\dfrac{x}{2}}{\dfrac{x}{2}} \label{eq:1-9-2}\end{equation}

As $x \to 0$, we have $\dfrac{x}{2} \to 0$, so by 1.8, $\dfrac{\sin\dfrac{x}{2}}{\dfrac{x}{2}} \to 1$. Also $\sin\dfrac{x}{2} \to 0$, hence

\begin{equation}\lim_{x \to 0} \dfrac{1 - \cos x}{x} = 0 \cdot 1 = 0 \label{eq:1-9-3}\end{equation}

Note: Used in the proof of the derivative of the sine function in the form $\dfrac{\cos h - 1}{h} \to 0$.

1.10 Hyperbolic Identity

Formula: $\cosh^2 x - \sinh^2 x = 1$

Condition: $\displaystyle \sinh x = \dfrac{e^x - e^{-x}}{2}$, $\displaystyle \cosh x = \dfrac{e^x + e^{-x}}{2}$

Proof

We compute directly from the definitions of the hyperbolic functions.

Computing $\cosh^2 x$,

\begin{equation}\cosh^2 x = \left( \dfrac{e^x + e^{-x}}{2} \right)^2 = \dfrac{e^{2x} + 2 + e^{-2x}}{4} \label{eq:1-10-1}\end{equation}

Computing $\sinh^2 x$,

\begin{equation}\sinh^2 x = \left( \dfrac{e^x - e^{-x}}{2} \right)^2 = \dfrac{e^{2x} - 2 + e^{-2x}}{4} \label{eq:1-10-2}\end{equation}

Subtracting $\eqref{eq:1-10-2}$ from $\eqref{eq:1-10-1}$,

\begin{equation}\cosh^2 x - \sinh^2 x = \dfrac{e^{2x} + 2 + e^{-2x}}{4} - \dfrac{e^{2x} - 2 + e^{-2x}}{4} = \dfrac{4}{4} = 1 \label{eq:1-10-3}\end{equation}

Note: Used in the derivative of hyperbolic tangent 1.39. This is the hyperbolic analogue of the Pythagorean identity.

1.3 Fundamental Theorems of Linear Algebra

In matrix calculus, properties of the trace and determinant are frequently used. This section proves these basic properties.

1.11 Linearity of the Trace

Formula: $\text{tr}(\alpha \boldsymbol{A} + \beta \boldsymbol{B}) = \alpha \text{tr}(\boldsymbol{A}) + \beta \text{tr}(\boldsymbol{B})$

Condition: $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times n}$, $\alpha, \beta \in \mathbb{R}$

Proof

We prove this directly from the definition of the trace.

The trace is defined as the sum of the diagonal elements.

\begin{equation}\text{tr}(\boldsymbol{A}) = \displaystyle\sum_{i=0}^{n-1} A_{ii} \label{eq:1-11-1}\end{equation}

The $(i, i)$ entry of $\alpha \boldsymbol{A} + \beta \boldsymbol{B}$ is $\alpha A_{ii} + \beta B_{ii}$.

\begin{equation}\text{tr}(\alpha \boldsymbol{A} + \beta \boldsymbol{B}) = \displaystyle\sum_{i=0}^{n-1} (\alpha A_{ii} + \beta B_{ii}) \label{eq:1-11-2}\end{equation}

By the linearity of summation,

\begin{equation}\text{tr}(\alpha \boldsymbol{A} + \beta \boldsymbol{B}) = \alpha \displaystyle\sum_{i=0}^{n-1} A_{ii} + \beta \displaystyle\sum_{i=0}^{n-1} B_{ii} = \alpha \text{tr}(\boldsymbol{A}) + \beta \text{tr}(\boldsymbol{B}) \label{eq:1-11-3}\end{equation}

Note: Frequently used in matrix calculus computations involving the trace.

1.12 Cyclic Property of the Trace

Formula: $\text{tr}(\boldsymbol{ABC}) = \text{tr}(\boldsymbol{BCA}) = \text{tr}(\boldsymbol{CAB})$

Condition: Matrix sizes such that $\boldsymbol{ABC}$ is a square matrix

Proof

We first prove the two-matrix case $\text{tr}(\boldsymbol{AB}) = \text{tr}(\boldsymbol{BA})$, then extend to three matrices.

Let $\boldsymbol{A} \in \mathbb{R}^{m \times n}$ and $\boldsymbol{B} \in \mathbb{R}^{n \times m}$. By the definition of matrix multiplication,

\begin{equation}(\boldsymbol{AB})_{ij} = \displaystyle\sum_{k=0}^{n-1} A_{ik} B_{kj} \label{eq:1-12-1}\end{equation}

Computing the trace of $\boldsymbol{AB} \in \mathbb{R}^{m \times m}$,

\begin{equation}\text{tr}(\boldsymbol{AB}) = \displaystyle\sum_{i=0}^{m-1} (\boldsymbol{AB})_{ii} = \displaystyle\sum_{i=0}^{m-1} \displaystyle\sum_{k=0}^{n-1} A_{ik} B_{ki} \label{eq:1-12-2}\end{equation}

Similarly, computing the trace of $\boldsymbol{BA} \in \mathbb{R}^{n \times n}$,

\begin{equation}\text{tr}(\boldsymbol{BA}) = \displaystyle\sum_{k=0}^{n-1} (\boldsymbol{BA})_{kk} = \displaystyle\sum_{k=0}^{n-1} \displaystyle\sum_{i=0}^{m-1} B_{ki} A_{ik} \label{eq:1-12-3}\end{equation}

Comparing $\eqref{eq:1-12-2}$ and $\eqref{eq:1-12-3}$, by interchanging the order of summation,

\begin{equation}\text{tr}(\boldsymbol{AB}) = \displaystyle\sum_{i=0}^{m-1} \displaystyle\sum_{k=0}^{n-1} A_{ik} B_{ki} = \displaystyle\sum_{k=0}^{n-1} \displaystyle\sum_{i=0}^{m-1} A_{ik} B_{ki} = \displaystyle\sum_{k=0}^{n-1} \displaystyle\sum_{i=0}^{m-1} B_{ki} A_{ik} = \text{tr}(\boldsymbol{BA}) \label{eq:1-12-4}\end{equation}

We extend to three matrices. Setting $\boldsymbol{D} = \boldsymbol{AB}$,

\begin{equation}\text{tr}(\boldsymbol{ABC}) = \text{tr}(\boldsymbol{DC}) = \text{tr}(\boldsymbol{CD}) = \text{tr}(\boldsymbol{CAB}) \label{eq:1-12-5}\end{equation}

Similarly, setting $\boldsymbol{E} = \boldsymbol{BC}$,

\begin{equation}\text{tr}(\boldsymbol{ABC}) = \text{tr}(\boldsymbol{AE}) = \text{tr}(\boldsymbol{EA}) = \text{tr}(\boldsymbol{BCA}) \label{eq:1-12-6}\end{equation}

Note: This is the most frequently used property in matrix calculus computations involving the trace. Note that $\text{tr}(\boldsymbol{ABC}) \neq \text{tr}(\boldsymbol{ACB})$ in general (only cyclic permutations are valid).

1.13 Trace and Transpose

Formula: $\text{tr}(\boldsymbol{A}^\top) = \text{tr}(\boldsymbol{A})$

Condition: $\boldsymbol{A} \in \mathbb{R}^{n \times n}$

Proof

The diagonal elements of the transpose are the same as those of the original matrix.

By the definition of the transpose, $(\boldsymbol{A}^\top)_{ij} = A_{ji}$. In particular, for diagonal elements,

\begin{equation}(\boldsymbol{A}^\top)_{ii} = A_{ii} \label{eq:1-13-1}\end{equation}

By the definition of the trace,

\begin{equation}\text{tr}(\boldsymbol{A}^\top) = \displaystyle\sum_{i=0}^{n-1} (\boldsymbol{A}^\top)_{ii} = \displaystyle\sum_{i=0}^{n-1} A_{ii} = \text{tr}(\boldsymbol{A}) \label{eq:1-13-2}\end{equation}

1.14 Determinant of a Product

Formula: $\det(\boldsymbol{AB}) = \det(\boldsymbol{A}) \det(\boldsymbol{B})$

Condition: $\boldsymbol{A}, \boldsymbol{B} \in \mathbb{R}^{n \times n}$

Proof

We prove this using the determinant of a block matrix.

Consider the following block matrix.

\begin{equation}\boldsymbol{M} = \begin{pmatrix} \boldsymbol{A} & \boldsymbol{O} \\ -\boldsymbol{I} & \boldsymbol{B} \end{pmatrix} \label{eq:1-14-1}\end{equation}

Apply the elementary row operation of right-multiplying the first block row by $\boldsymbol{B}$ and adding it to the second block row of $\boldsymbol{M}$. This operation does not change the determinant.

\begin{equation}\begin{pmatrix} \boldsymbol{I} & \boldsymbol{O} \\ \boldsymbol{O} & \boldsymbol{I} \end{pmatrix} \begin{pmatrix} \boldsymbol{A} & \boldsymbol{O} \\ -\boldsymbol{I} & \boldsymbol{B} \end{pmatrix} \begin{pmatrix} \boldsymbol{I} & \boldsymbol{B} \\ \boldsymbol{O} & \boldsymbol{I} \end{pmatrix} = \begin{pmatrix} \boldsymbol{A} & \boldsymbol{AB} \\ -\boldsymbol{I} & \boldsymbol{O} \end{pmatrix} \label{eq:1-14-2}\end{equation}

Further, left-multiply the second block row by $\boldsymbol{A}$ and add to the first block row.

\begin{equation}\begin{pmatrix} \boldsymbol{I} & \boldsymbol{A} \\ \boldsymbol{O} & \boldsymbol{I} \end{pmatrix} \begin{pmatrix} \boldsymbol{A} & \boldsymbol{AB} \\ -\boldsymbol{I} & \boldsymbol{O} \end{pmatrix} = \begin{pmatrix} \boldsymbol{O} & \boldsymbol{AB} \\ -\boldsymbol{I} & \boldsymbol{O} \end{pmatrix} \label{eq:1-14-3}\end{equation}

The determinant of a block triangular matrix equals the product of the determinants of the diagonal blocks.

\begin{equation}\det(\boldsymbol{M}) = \det\begin{pmatrix} \boldsymbol{A} & \boldsymbol{O} \\ -\boldsymbol{I} & \boldsymbol{B} \end{pmatrix} = \det(\boldsymbol{A}) \det(\boldsymbol{B}) \label{eq:1-14-4}\end{equation}

On the other hand, computing the determinant of the matrix in $\eqref{eq:1-14-3}$, by interchanging block rows and columns,

\begin{equation}\det\begin{pmatrix} \boldsymbol{O} & \boldsymbol{AB} \\ -\boldsymbol{I} & \boldsymbol{O} \end{pmatrix} = (-1)^n \det\begin{pmatrix} -\boldsymbol{I} & \boldsymbol{O} \\ \boldsymbol{O} & \boldsymbol{AB} \end{pmatrix} = (-1)^n \cdot (-1)^n \det(\boldsymbol{AB}) = \det(\boldsymbol{AB}) \label{eq:1-14-5}\end{equation}

Since elementary row operations do not change the determinant, from $\eqref{eq:1-14-4}$ and $\eqref{eq:1-14-5}$,

\begin{equation}\det(\boldsymbol{A}) \det(\boldsymbol{B}) = \det(\boldsymbol{AB}) \label{eq:1-14-6}\end{equation}

Note: Frequently used in deriving differentiation formulas for determinants.

1.15 Determinant of the Transpose

Formula: $\det(\boldsymbol{A}^\top) = \det(\boldsymbol{A})$

Condition: $\boldsymbol{A} \in \mathbb{R}^{n \times n}$

Proof

We prove this using the Leibniz formula for determinants (A.5).

The Leibniz formula for the determinant is

\begin{equation}\det(\boldsymbol{A}) = \displaystyle\sum_{\sigma \in S_n} \text{sgn}(\sigma) \displaystyle\prod_{i=0}^{n-1} A_{i, \sigma(i)} \label{eq:1-15-1}\end{equation}

where $S_n$ is the set of all permutations of $\{0, 1, \ldots, n-1\}$ and $\text{sgn}(\sigma)$ is the sign of the permutation $\sigma$.

Computing the determinant of the transpose $\boldsymbol{A}^\top$, since $(\boldsymbol{A}^\top)_{ij} = A_{ji}$,

\begin{equation}\det(\boldsymbol{A}^\top) = \displaystyle\sum_{\sigma \in S_n} \text{sgn}(\sigma) \displaystyle\prod_{i=0}^{n-1} (\boldsymbol{A}^\top)_{i, \sigma(i)} = \displaystyle\sum_{\sigma \in S_n} \text{sgn}(\sigma) \displaystyle\prod_{i=0}^{n-1} A_{\sigma(i), i} \label{eq:1-15-2}\end{equation}

Introducing the substitution $j = \sigma(i)$. Since $\sigma$ is a bijection, as $i$ ranges over $0$ to $n-1$, $j = \sigma(i)$ also takes each value from $0$ to $n-1$ exactly once. Using the inverse permutation $\sigma^{-1}$, we have $i = \sigma^{-1}(j)$.

\begin{equation}\displaystyle\prod_{i=0}^{n-1} A_{\sigma(i), i} = \displaystyle\prod_{j=0}^{n-1} A_{j, \sigma^{-1}(j)} \label{eq:1-15-3}\end{equation}

As $\sigma$ ranges over $S_n$, so does $\sigma^{-1}$. Also, $\text{sgn}(\sigma^{-1}) = \text{sgn}(\sigma)$. Substituting $\tau = \sigma^{-1}$,

\begin{equation}\det(\boldsymbol{A}^\top) = \displaystyle\sum_{\tau \in S_n} \text{sgn}(\tau) \displaystyle\prod_{j=0}^{n-1} A_{j, \tau(j)} = \det(\boldsymbol{A}) \label{eq:1-15-4}\end{equation}

Note: A fundamental result showing that properties of the determinant with respect to rows and columns are symmetric.

1.4 Derivatives of Basic Functions

Below, we derive the derivatives of basic functions directly from the definition. These results, combined with the rules of differentiation for composite functions, form the foundation for computing derivatives of more complex functions.

1.16 Derivative of a Constant Function

Formula: $\displaystyle\dfrac{d}{dx} c = 0$

Condition: $c$ is an arbitrary constant

Proof

Let $f(x) = c$ (a constant function). We compute using the definition of the derivative.

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{f(x+h) - f(x)}{h} \label{eq:1-16-1}\end{equation}

Substituting $f(x+h) = c$ and $f(x) = c$ into $\eqref{eq:1-16-1}$,

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{c - c}{h} = \lim_{h \to 0} \dfrac{0}{h} = \lim_{h \to 0} 0 = 0 \label{eq:1-16-2}\end{equation}

Note: Geometrically, the graph of a constant function is a horizontal line with slope 0.

1.17 Derivative of the Identity Function

Formula: $\displaystyle\dfrac{d}{dx} x = 1$

Proof

Let $f(x) = x$. We compute using the definition of the derivative.

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{f(x+h) - f(x)}{h} \label{eq:1-17-1}\end{equation}

Substituting $f(x+h) = x + h$ and $f(x) = x$ into $\eqref{eq:1-17-1}$,

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{(x+h) - x}{h} = \lim_{h \to 0} \dfrac{h}{h} = \lim_{h \to 0} 1 = 1 \label{eq:1-17-2}\end{equation}

Note: The graph of $y = x$ is a straight line with slope 1.

1.18 Derivative of the Power Function (Positive Integer)

Formula: $\displaystyle\dfrac{d}{dx} x^n = n x^{n-1}$

Condition: $n$ is a positive integer

Proof

Let $f(x) = x^n$ ($n$ is a positive integer). We compute using the definition of the derivative.

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{(x+h)^n - x^n}{h} \label{eq:1-18-1}\end{equation}

Using the binomial theorem (1.5) to expand $(x+h)^n$,

\begin{equation}(x+h)^n = \displaystyle\sum_{k=0}^{n} \binom{n}{k} x^{n-k} h^k = x^n + nx^{n-1}h + \binom{n}{2}x^{n-2}h^2 + \cdots + h^n \label{eq:1-18-2}\end{equation}

Substituting $\eqref{eq:1-18-2}$ into $\eqref{eq:1-18-1}$,

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{x^n + nx^{n-1}h + \binom{n}{2}x^{n-2}h^2 + \cdots + h^n - x^n}{h} \label{eq:1-18-3}\end{equation}

The $x^n$ terms cancel, and dividing each term in the numerator by $h$,

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \left[ nx^{n-1} + \binom{n}{2}x^{n-2}h + \cdots + h^{n-1} \right] \label{eq:1-18-4}\end{equation}

Taking the limit as $h \to 0$, all terms from the second onward contain positive powers of $h$ and converge to 0.

\begin{equation}\dfrac{d}{dx} x^n = nx^{n-1} \label{eq:1-18-5}\end{equation}

Note: In matrix calculus, this formula is indirectly used in derivatives such as $\text{tr}(\boldsymbol{X}^n)$.

1.19 Derivative of the Power Function (General Real Exponent)

Formula: $\displaystyle\dfrac{d}{dx} x^a = a x^{a-1}$

Condition: $a$ is any real number, $x > 0$

Proof

For $x > 0$, we can write $x^a = e^{a \ln x}$. We differentiate using this representation.

Remark: In analysis, the natural logarithm $\ln$ is defined as the inverse function of $e^y$, and $x^a = e^{a \ln x}$ is adopted as the definition of general real powers. This reduces the derivative of the power function to derivatives of exponential and logarithmic functions.

Let $f(x) = x^a = e^{a \ln x}$. Applying the chain rule (1.26),

\begin{equation}\dfrac{df}{dx} = \dfrac{d}{dx} e^{a \ln x} \label{eq:1-19-1}\end{equation}

Setting $u = a \ln x$, we have $f = e^u$, so

\begin{equation}\dfrac{df}{dx} = \dfrac{de^u}{du} \cdot \dfrac{du}{dx} \label{eq:1-19-2}\end{equation}

Using $\displaystyle \dfrac{d}{du} e^u = e^u$ (1.20) and $\displaystyle \dfrac{d}{dx}(a \ln x) = \dfrac{a}{x}$ (1.21),

\begin{equation}\dfrac{df}{dx} = e^{a \ln x} \cdot \dfrac{a}{x} = x^a \cdot \dfrac{a}{x} = a x^{a-1} \label{eq:1-19-3}\end{equation}

Note: For $a = -1$, $\displaystyle \dfrac{d}{dx} x^{-1} = -x^{-2} = -\dfrac{1}{x^2}$; for $a = \dfrac{1}{2}$, $\displaystyle \dfrac{d}{dx} \sqrt{x} = \dfrac{1}{2\sqrt{x}}$.

1.20 Derivative of the Exponential Function

Formula: $\displaystyle\dfrac{d}{dx} e^x = e^x$

Proof

Let $f(x) = e^x$. We compute using the definition of the derivative.

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{e^{x+h} - e^x}{h} \label{eq:1-20-1}\end{equation}

Using the exponential law $e^{x+h} = e^x \cdot e^h$, we factor out $e^x$.

\begin{equation}\dfrac{df}{dx} = \lim_{h \to 0} \dfrac{e^x \cdot e^h - e^x}{h} = \lim_{h \to 0} e^x \cdot \dfrac{e^h - 1}{h} = e^x \cdot \lim_{h \to 0} \dfrac{e^h - 1}{h} \label{eq:1-20-2}\end{equation}

We compute the limit $\lim_{h \to 0} \dfrac{e^h - 1}{h}$. Here we use the Taylor expansion of $e^h$ as a known result (the convergence of the Taylor series and the validity of term-by-term operations are justified separately in analysis).

\begin{equation}e^h = 1 + h + \dfrac{h^2}{2!} + \dfrac{h^3}{3!} + \cdots \label{eq:1-20-3}\end{equation}

From $\eqref{eq:1-20-3}$,

\begin{equation}e^h - 1 = h + \dfrac{h^2}{2!} + \dfrac{h^3}{3!} + \cdots \label{eq:1-20-4}\end{equation}

Dividing both sides of $\eqref{eq:1-20-4}$ by $h$,

\begin{equation}\dfrac{e^h - 1}{h} = 1 + \dfrac{h}{2!} + \dfrac{h^2}{3!} + \cdots \label{eq:1-20-5}\end{equation}

Taking the limit as $h \to 0$,

\begin{equation}\lim_{h \to 0} \dfrac{e^h - 1}{h} = 1 \label{eq:1-20-6}\end{equation}

Substituting $\eqref{eq:1-20-6}$ into $\eqref{eq:1-20-2}$,

\begin{equation}\dfrac{d}{dx} e^x = e^x \cdot 1 = e^x \label{eq:1-20-7}\end{equation}

Note: $e^x$ is the unique function (up to a constant factor) that remains unchanged under differentiation. This property is one of the definitions of $e$. Important in the derivatives of activation functions in neural networks.

1.21 Derivative of the Natural Logarithm

Formula: $\displaystyle\dfrac{d}{dx} \ln x = \dfrac{1}{x}$

Condition: $x > 0$

Proof

Let $f(x) = \ln x$. We use the inverse function differentiation formula.

Setting $y = \ln x$, we have $x = e^y$. We differentiate both sides with respect to $x$.

By the inverse function rule (1.27),

\begin{equation}\dfrac{dy}{dx} = \dfrac{1}{\dfrac{dx}{dy}} \label{eq:1-21-1}\end{equation}

Differentiating $x = e^y$ with respect to $y$, by 1.20, $\displaystyle \dfrac{dx}{dy} = e^y$.

\begin{equation}\dfrac{dy}{dx} = \dfrac{1}{e^y} \label{eq:1-21-2}\end{equation}

Substituting $e^y = x$ (by the definition $y = \ln x$) into $\eqref{eq:1-21-2}$,

\begin{equation}\dfrac{d}{dx} \ln x = \dfrac{1}{x} \label{eq:1-21-3}\end{equation}

Note: Essential in matrix calculus for the derivative of $\log|\boldsymbol{A}|$.

1.22 Derivative of the General Exponential Function

Formula: $\displaystyle\dfrac{d}{dx} a^x = a^x \ln a$

Condition: $a > 0$, $a \neq 1$

Proof

We rewrite $a^x = e^{x \ln a}$.

\begin{equation}a^x = (e^{\ln a})^x = e^{x \ln a} \label{eq:1-22-1}\end{equation}

Applying the chain rule (1.26), setting $u = x \ln a$,

\begin{equation}\dfrac{d}{dx} a^x = \dfrac{d}{dx} e^u = \dfrac{de^u}{du} \cdot \dfrac{du}{dx} \label{eq:1-22-2}\end{equation}

Using $\displaystyle \dfrac{d}{du} e^u = e^u$ (1.20) and $\displaystyle \dfrac{d}{dx}(x \ln a) = \ln a$ ($\ln a$ is a constant),

\begin{equation}\dfrac{d}{dx} a^x = e^{x \ln a} \cdot \ln a = a^x \ln a \label{eq:1-22-3}\end{equation}

Note: When $a = e$, since $\ln e = 1$, we get $\displaystyle \dfrac{d}{dx} e^x = e^x$, consistent with 1.20.

1.23 Derivative of the General Logarithmic Function

Formula: $\displaystyle\dfrac{d}{dx} \log_a x = \dfrac{1}{x \ln a}$

Condition: $a > 0$, $a \neq 1$, $x > 0$

Proof

We express $\log_a x$ in terms of the natural logarithm using the change of base formula.

\begin{equation}\log_a x = \dfrac{\ln x}{\ln a} \label{eq:1-23-1}\end{equation}

Since $\ln a$ is a constant,

\begin{equation}\dfrac{d}{dx} \log_a x = \dfrac{1}{\ln a} \cdot \dfrac{d}{dx} \ln x \label{eq:1-23-2}\end{equation}

Substituting $\displaystyle \dfrac{d}{dx} \ln x = \dfrac{1}{x}$ from 1.21,

\begin{equation}\dfrac{d}{dx} \log_a x = \dfrac{1}{\ln a} \cdot \dfrac{1}{x} = \dfrac{1}{x \ln a} \label{eq:1-23-3}\end{equation}

Note: When $a = e$, since $\ln e = 1$, we get $\displaystyle \dfrac{d}{dx} \ln x = \dfrac{1}{x}$, consistent with 1.21.

1.5 Rules of Differentiation

The rules of differentiation allow us to differentiate complex functions as combinations of basic functions. These rules also hold (in suitably generalized forms) in matrix calculus.

1.24 Linearity (Sum and Scalar Multiple)

Formula: $\displaystyle\dfrac{d}{dx}[af(x) + bg(x)] = a\dfrac{df}{dx} + b\dfrac{dg}{dx}$

Condition: $a, b$ are constants, $f, g$ are differentiable functions

Proof

Let $h(x) = af(x) + bg(x)$. We compute using the definition of the derivative.

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{h(x + \Delta x) - h(x)}{\Delta x} \label{eq:1-24-1}\end{equation}

Substituting $h(x + \Delta x) = af(x + \Delta x) + bg(x + \Delta x)$,

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{af(x + \Delta x) + bg(x + \Delta x) - af(x) - bg(x)}{\Delta x} \label{eq:1-24-2}\end{equation}

Rearranging the terms,

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \left[ a \cdot \dfrac{f(x + \Delta x) - f(x)}{\Delta x} + b \cdot \dfrac{g(x + \Delta x) - g(x)}{\Delta x} \right] \label{eq:1-24-3}\end{equation}

By the linearity of limits,

\begin{equation}\dfrac{dh}{dx} = a \cdot \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x) - f(x)}{\Delta x} + b \cdot \lim_{\Delta x \to 0} \dfrac{g(x + \Delta x) - g(x)}{\Delta x} \label{eq:1-24-4}\end{equation}

By the definition of the derivative,

\begin{equation}\dfrac{d}{dx}[af(x) + bg(x)] = a\dfrac{df}{dx} + b\dfrac{dg}{dx} \label{eq:1-24-5}\end{equation}

Note: The differential operator $\displaystyle \dfrac{d}{dx}$ is a linear operator. The matrix derivative $\displaystyle \dfrac{\partial}{\partial \boldsymbol{X}}$ is similarly linear.

1.25 Product Rule (Leibniz Rule)

Formula: $\displaystyle\dfrac{d}{dx}[f(x)g(x)] = f'(x)g(x) + f(x)g'(x)$

Condition: $f, g$ are differentiable functions

Proof

Let $h(x) = f(x)g(x)$. We compute using the definition of the derivative.

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x)g(x + \Delta x) - f(x)g(x)}{\Delta x} \label{eq:1-25-1}\end{equation}

We add and subtract $f(x + \Delta x)g(x)$ ($= 0$) in the numerator.

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x)g(x + \Delta x) - f(x + \Delta x)g(x) + f(x + \Delta x)g(x) - f(x)g(x)}{\Delta x} \label{eq:1-25-2}\end{equation}

Grouping the terms,

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \left[ f(x + \Delta x) \cdot \dfrac{g(x + \Delta x) - g(x)}{\Delta x} + g(x) \cdot \dfrac{f(x + \Delta x) - f(x)}{\Delta x} \right] \label{eq:1-25-3}\end{equation}

Applying the limit to each term. Since $f$ is differentiable, it is continuous (1.3), so $\lim_{\Delta x \to 0} f(x + \Delta x) = f(x)$.

\begin{equation}\dfrac{dh}{dx} = f(x) \cdot \lim_{\Delta x \to 0} \dfrac{g(x + \Delta x) - g(x)}{\Delta x} + g(x) \cdot \lim_{\Delta x \to 0} \dfrac{f(x + \Delta x) - f(x)}{\Delta x} \label{eq:1-25-4}\end{equation}

By the definition of the derivative,

\begin{equation}\dfrac{d}{dx}[f(x)g(x)] = f(x)g'(x) + g(x)f'(x) = f'(x)g(x) + f(x)g'(x) \label{eq:1-25-5}\end{equation}

Note: The same rule applies to the derivative of a matrix product $\displaystyle \dfrac{\partial}{\partial X_{ij}}(\boldsymbol{AB})_{kl}$. The generalization to $n$ factors is $\displaystyle \dfrac{d}{dx}[f_0 \cdots f_{n-1}] = \displaystyle\sum_{i=0}^{n-1} f_0 \cdots f_{i-1} f'_i f_{i+1} \cdots f_{n-1}$.
Source: G.W. Leibniz (1684) "Nova methodus pro maximis et minimis", Acta Eruditorum. Known as the "Leibniz rule."

1.26 Chain Rule (Derivative of Composite Functions)

Formula: $\displaystyle\dfrac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$

Condition: $g$ is differentiable at $x$, $f$ is differentiable at $g(x)$

Proof

Let $h(x) = f(g(x))$. Substituting $u = g(x)$, we have $h = f(u)$.

We compute using the definition of the derivative.

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{f(g(x + \Delta x)) - f(g(x))}{\Delta x} \label{eq:1-26-1}\end{equation}

Let $\Delta u = g(x + \Delta x) - g(x)$. Since $g$ is differentiable, it is continuous (1.3), so $\Delta u \to 0$ as $\Delta x \to 0$.

When $\Delta u \neq 0$, we multiply and divide by $\Delta u$ (even if there exist points $\Delta x$ where $\Delta u = 0$, this does not affect the limit value, because we can evaluate the limit using a sequence of points where $\Delta u \neq 0$ as $\Delta x \to 0$).

\begin{equation}\dfrac{dh}{dx} = \lim_{\Delta x \to 0} \dfrac{f(g(x) + \Delta u) - f(g(x))}{\Delta u} \cdot \dfrac{\Delta u}{\Delta x} \label{eq:1-26-2}\end{equation}

Setting $u = g(x)$, the first factor is

\begin{equation}\lim_{\Delta u \to 0} \dfrac{f(u + \Delta u) - f(u)}{\Delta u} = f'(u) = f'(g(x)) \label{eq:1-26-3}\end{equation}

The second factor is

\begin{equation}\lim_{\Delta x \to 0} \dfrac{\Delta u}{\Delta x} = \lim_{\Delta x \to 0} \dfrac{g(x + \Delta x) - g(x)}{\Delta x} = g'(x) \label{eq:1-26-4}\end{equation}

Substituting $\eqref{eq:1-26-3}$ and $\eqref{eq:1-26-4}$ into $\eqref{eq:1-26-2}$,

\begin{equation}\dfrac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x) \label{eq:1-26-5}\end{equation}

In Leibniz notation,

\begin{equation}\dfrac{dh}{dx} = \dfrac{df}{du} \cdot \dfrac{du}{dx} \label{eq:1-26-6}\end{equation}

Note: The chain rule is one of the most important rules in matrix calculus. The derivative of a matrix function composition $f(\boldsymbol{U}(\boldsymbol{X}))$ is expressed in trace form as $\displaystyle \text{tr}\left[\left(\dfrac{\partial f}{\partial \boldsymbol{U}}\right)^\top \dfrac{\partial \boldsymbol{U}}{\partial X_{ij}}\right]$.
Source: Introduced by G.W. Leibniz (1684) "Nova methodus pro maximis et minimis" along with differential notation. A rigorous proof was given by A.L. Cauchy (1821) "Cours d'analyse."

1.27 Inverse Function Differentiation

Formula: $\displaystyle\dfrac{dx}{dy} = \dfrac{1}{\dfrac{dy}{dx}}$

Condition: $y = f(x)$ is strictly monotonic and differentiable, $f'(x) \neq 0$

Proof

Let $y = f(x)$, and denote the inverse function by $x = f^{-1}(y)$.

By definition, $f(f^{-1}(y)) = y$. Differentiating both sides with respect to $y$,

\begin{equation}\dfrac{d}{dy} f(f^{-1}(y)) = \dfrac{d}{dy} y = 1 \label{eq:1-27-1}\end{equation}

Applying the chain rule (1.26) to the left-hand side, setting $u = f^{-1}(y)$,

\begin{equation}\dfrac{df}{du} \cdot \dfrac{du}{dy} = 1 \label{eq:1-27-2}\end{equation}

Since $u = f^{-1}(y) = x$, we have $\displaystyle \dfrac{df}{du} = \dfrac{dy}{dx} = f'(x)$.

\begin{equation}f'(x) \cdot \dfrac{dx}{dy} = 1 \label{eq:1-27-3}\end{equation}

When $f'(x) \neq 0$, solving $\eqref{eq:1-27-3}$,

\begin{equation}\dfrac{dx}{dy} = \dfrac{1}{f'(x)} = \dfrac{1}{\dfrac{dy}{dx}} \label{eq:1-27-4}\end{equation}

Note: The matrix analogue of this one-variable result is the inverse matrix derivative formula $\displaystyle \dfrac{d\boldsymbol{A}^{-1}}{dt} = -\boldsymbol{A}^{-1} \dfrac{d\boldsymbol{A}}{dt} \boldsymbol{A}^{-1}$.

1.28 Quotient Rule

Formula: $\displaystyle\dfrac{d}{dx}\dfrac{f(x)}{g(x)} = \dfrac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2}$

Condition: $f, g$ are differentiable, $g(x) \neq 0$

Proof

Write $h(x) = \dfrac{f(x)}{g(x)} = f(x) \cdot [g(x)]^{-1}$ and apply the product rule.

By the product rule (1.25),

\begin{equation}\dfrac{dh}{dx} = f'(x) \cdot [g(x)]^{-1} + f(x) \cdot \dfrac{d}{dx}[g(x)]^{-1} \label{eq:1-28-1}\end{equation}

We find the derivative of $[g(x)]^{-1}$. Setting $u = g(x)$, by the chain rule (1.26),

\begin{equation}\dfrac{d}{dx}[g(x)]^{-1} = \dfrac{d}{dx} u^{-1} = \dfrac{d(u^{-1})}{du} \cdot \dfrac{du}{dx} = (-u^{-2}) \cdot g'(x) = -\dfrac{g'(x)}{[g(x)]^2} \label{eq:1-28-2}\end{equation}

Substituting $\eqref{eq:1-28-2}$ into $\eqref{eq:1-28-1}$,

\begin{equation}\dfrac{dh}{dx} = \dfrac{f'(x)}{g(x)} + f(x) \cdot \left(-\dfrac{g'(x)}{[g(x)]^2}\right) \label{eq:1-28-3}\end{equation}

Combining over a common denominator,

\begin{equation}\dfrac{dh}{dx} = \dfrac{f'(x) \cdot g(x)}{[g(x)]^2} - \dfrac{f(x) \cdot g'(x)}{[g(x)]^2} = \dfrac{f'(x)g(x) - f(x)g'(x)}{[g(x)]^2} \label{eq:1-28-4}\end{equation}

Note: Mnemonic: "derivative of numerator times denominator, minus numerator times derivative of denominator, all over denominator squared."

1.29 Logarithmic Differentiation

Formula: $\displaystyle\dfrac{d}{dx}[f(x)]^{g(x)} = [f(x)]^{g(x)} \left[ g'(x) \ln f(x) + g(x) \dfrac{f'(x)}{f(x)} \right]$

Condition: $f(x) > 0$, $f, g$ are differentiable

Proof

Let $h(x) = [f(x)]^{g(x)}$. Since $f(x) > 0$, take the natural logarithm of both sides.

\begin{equation}\ln h(x) = g(x) \ln f(x) \label{eq:1-29-1}\end{equation}

Differentiating both sides of $\eqref{eq:1-29-1}$ with respect to $x$. The left-hand side, by the chain rule, is

\begin{equation}\dfrac{d}{dx} \ln h(x) = \dfrac{1}{h(x)} \cdot h'(x) = \dfrac{h'(x)}{h(x)} \label{eq:1-29-2}\end{equation}

The right-hand side, by the product rule, is

\begin{equation}\dfrac{d}{dx}[g(x) \ln f(x)] = g'(x) \ln f(x) + g(x) \cdot \dfrac{f'(x)}{f(x)} \label{eq:1-29-3}\end{equation}

From $\eqref{eq:1-29-2}$ and $\eqref{eq:1-29-3}$,

\begin{equation}\dfrac{h'(x)}{h(x)} = g'(x) \ln f(x) + g(x) \dfrac{f'(x)}{f(x)} \label{eq:1-29-4}\end{equation}

Multiplying both sides by $h(x) = [f(x)]^{g(x)}$,

\begin{equation}h'(x) = [f(x)]^{g(x)} \left[ g'(x) \ln f(x) + g(x) \dfrac{f'(x)}{f(x)} \right] \label{eq:1-29-5}\end{equation}

Note: Special cases: when $g(x) = n$ (constant), $\displaystyle \dfrac{d}{dx}[f(x)]^n = n[f(x)]^{n-1} f'(x)$. When $f(x) = x$ and $g(x) = x$, $\displaystyle \dfrac{d}{dx} x^x = x^x (\ln x + 1)$.

1.6 Derivatives of Trigonometric Functions

We derive the differentiation formulas for trigonometric functions and their inverses. These are extensively used in Fourier analysis and signal processing, and are also needed in matrix calculus when differentiating functions involving trigonometric functions.

1.30 Derivative of the Sine Function

Formula: $\displaystyle\dfrac{d}{dx} \sin x = \cos x$

Proof

We compute using the definition of the derivative.

\begin{equation}\dfrac{d}{dx} \sin x = \lim_{h \to 0} \dfrac{\sin(x+h) - \sin x}{h} \label{eq:1-30-1}\end{equation}

Using the addition formula (1.7) $\sin(x+h) = \sin x \cos h + \cos x \sin h$,

\begin{equation}\dfrac{d}{dx} \sin x = \lim_{h \to 0} \dfrac{\sin x \cos h + \cos x \sin h - \sin x}{h} \label{eq:1-30-2}\end{equation}

Rearranging the terms,

\begin{equation}\dfrac{d}{dx} \sin x = \lim_{h \to 0} \left[ \sin x \cdot \dfrac{\cos h - 1}{h} + \cos x \cdot \dfrac{\sin h}{h} \right] \label{eq:1-30-3}\end{equation}

Using the fundamental limits (1.8, 1.9),

\begin{equation}\lim_{h \to 0} \dfrac{\sin h}{h} = 1 \label{eq:1-30-4}\end{equation}

\begin{equation}\lim_{h \to 0} \dfrac{\cos h - 1}{h} = 0 \label{eq:1-30-5}\end{equation}

Substituting $\eqref{eq:1-30-4}$ and $\eqref{eq:1-30-5}$ into $\eqref{eq:1-30-3}$,

\begin{equation}\dfrac{d}{dx} \sin x = \sin x \cdot 0 + \cos x \cdot 1 = \cos x \label{eq:1-30-7}\end{equation}

Note: For the proofs of the fundamental limits, see 1.8 and 1.9.

1.31 Derivative of the Cosine Function

Formula: $\displaystyle\dfrac{d}{dx} \cos x = -\sin x$

Proof

We use the identity $\cos x = \sin\left(\dfrac{\pi}{2} - x\right)$.

Applying the chain rule (1.26), setting $u = \displaystyle \dfrac{\pi}{2} - x$,

\begin{equation}\dfrac{d}{dx} \cos x = \dfrac{d}{dx} \sin u = \dfrac{d(\sin u)}{du} \cdot \dfrac{du}{dx} \label{eq:1-31-1}\end{equation}

By 1.30, $\displaystyle \dfrac{d(\sin u)}{du} = \cos u$. Also, $\displaystyle \dfrac{du}{dx} = -1$.

\begin{equation}\dfrac{d}{dx} \cos x = \cos u \cdot (-1) = -\cos\left(\dfrac{\pi}{2} - x\right) = -\sin x \label{eq:1-31-2}\end{equation}

The last equality uses $\cos\left(\dfrac{\pi}{2} - x\right) = \sin x$.

1.32 Derivative of the Tangent Function

Formula: $\displaystyle\dfrac{d}{dx} \tan x = \sec^2 x = \dfrac{1}{\cos^2 x}$

Condition: $\cos x \neq 0$

Proof

We apply the quotient rule (1.28) to $\tan x = \dfrac{\sin x}{\cos x}$.

\begin{equation}\dfrac{d}{dx} \tan x = \dfrac{(\sin x)' \cos x - \sin x (\cos x)'}{\cos^2 x} \label{eq:1-32-1}\end{equation}

Substituting $(\sin x)' = \cos x$ and $(\cos x)' = -\sin x$ from 1.30 and 1.31,

\begin{equation}\dfrac{d}{dx} \tan x = \dfrac{\cos x \cdot \cos x - \sin x \cdot (-\sin x)}{\cos^2 x} = \dfrac{\cos^2 x + \sin^2 x}{\cos^2 x} \label{eq:1-32-2}\end{equation}

By the Pythagorean identity (1.6) $\cos^2 x + \sin^2 x = 1$,

\begin{equation}\dfrac{d}{dx} \tan x = \dfrac{1}{\cos^2 x} = \sec^2 x \label{eq:1-32-3}\end{equation}

1.33 Derivatives of Other Trigonometric Functions

Formulas:
$\displaystyle \dfrac{d}{dx} \cot x = -\csc^2 x$
$\displaystyle \dfrac{d}{dx} \sec x = \sec x \tan x$
$\displaystyle \dfrac{d}{dx} \csc x = -\csc x \cot x$

Proof

Derivative of $\cot x$:

Applying the quotient rule to $\cot x = \dfrac{\cos x}{\sin x}$,

\begin{equation}\dfrac{d}{dx} \cot x = \dfrac{-\sin x \cdot \sin x - \cos x \cdot \cos x}{\sin^2 x} = \dfrac{-(\sin^2 x + \cos^2 x)}{\sin^2 x} = -\dfrac{1}{\sin^2 x} = -\csc^2 x \label{eq:1-33-1}\end{equation}

Derivative of $\sec x$:

Applying the chain rule to $\sec x = \dfrac{1}{\cos x} = (\cos x)^{-1}$,

\begin{equation}\dfrac{d}{dx} \sec x = -(\cos x)^{-2} \cdot (-\sin x) = \dfrac{\sin x}{\cos^2 x} = \dfrac{1}{\cos x} \cdot \dfrac{\sin x}{\cos x} = \sec x \tan x \label{eq:1-33-2}\end{equation}

Derivative of $\csc x$:

Applying the chain rule to $\csc x = \dfrac{1}{\sin x} = (\sin x)^{-1}$,

\begin{equation}\dfrac{d}{dx} \csc x = -(\sin x)^{-2} \cdot \cos x = -\dfrac{\cos x}{\sin^2 x} = -\dfrac{1}{\sin x} \cdot \dfrac{\cos x}{\sin x} = -\csc x \cot x \label{eq:1-33-3}\end{equation}

1.7 Derivatives of Inverse Trigonometric Functions

1.34 Derivative of the Arcsine Function

Formula: $\displaystyle\dfrac{d}{dx} \arcsin x = \dfrac{1}{\sqrt{1 - x^2}}$

Condition: $-1 < x < 1$

Proof

Let $y = \arcsin x$, so $x = \sin y$ with $-\dfrac{\pi}{2} \leq y \leq \dfrac{\pi}{2}$.

Applying the inverse function rule (1.27),

\begin{equation}\dfrac{dy}{dx} = \dfrac{1}{\dfrac{dx}{dy}} = \dfrac{1}{\cos y} \label{eq:1-34-1}\end{equation}

Expressing $\cos y$ in terms of $x$. From $\sin^2 y + \cos^2 y = 1$,

\begin{equation}\cos y = \pm\sqrt{1 - \sin^2 y} = \pm\sqrt{1 - x^2} \label{eq:1-34-2}\end{equation}

Since $\cos y \geq 0$ for $-\dfrac{\pi}{2} \leq y \leq \dfrac{\pi}{2}$, we take the positive square root.

\begin{equation}\cos y = \sqrt{1 - x^2} \label{eq:1-34-3}\end{equation}

Substituting $\eqref{eq:1-34-3}$ into $\eqref{eq:1-34-1}$,

\begin{equation}\dfrac{d}{dx} \arcsin x = \dfrac{1}{\sqrt{1 - x^2}} \label{eq:1-34-4}\end{equation}

1.35 Derivative of the Arccosine Function

Formula: $\displaystyle\dfrac{d}{dx} \arccos x = -\dfrac{1}{\sqrt{1 - x^2}}$

Condition: $-1 < x < 1$

Proof

Let $y = \arccos x$, so $x = \cos y$ with $0 \leq y \leq \pi$.

Applying the inverse function rule,

\begin{equation}\dfrac{dy}{dx} = \dfrac{1}{\dfrac{dx}{dy}} = \dfrac{1}{-\sin y} \label{eq:1-35-1}\end{equation}

Expressing $\sin y$ in terms of $x$. From $\sin^2 y + \cos^2 y = 1$,

\begin{equation}\sin y = \pm\sqrt{1 - \cos^2 y} = \pm\sqrt{1 - x^2} \label{eq:1-35-2}\end{equation}

Since $\sin y \geq 0$ for $0 \leq y \leq \pi$, we take the positive square root.

\begin{equation}\sin y = \sqrt{1 - x^2} \label{eq:1-35-3}\end{equation}

Substituting $\eqref{eq:1-35-3}$ into $\eqref{eq:1-35-1}$,

\begin{equation}\dfrac{d}{dx} \arccos x = -\dfrac{1}{\sqrt{1 - x^2}} \label{eq:1-35-4}\end{equation}

Note: Since $\displaystyle \arcsin x + \arccos x = \dfrac{\pi}{2}$, we can verify that $\displaystyle \dfrac{d}{dx} \arccos x = -\dfrac{d}{dx} \arcsin x$.

1.36 Derivative of the Arctangent Function

Formula: $\displaystyle\dfrac{d}{dx} \arctan x = \dfrac{1}{1 + x^2}$

Proof

Let $y = \arctan x$, so $x = \tan y$ with $-\dfrac{\pi}{2} < y < \dfrac{\pi}{2}$.

Applying the inverse function rule,

\begin{equation}\dfrac{dy}{dx} = \dfrac{1}{\dfrac{dx}{dy}} = \dfrac{1}{\sec^2 y} = \cos^2 y \label{eq:1-36-1}\end{equation}

Expressing $\cos^2 y$ in terms of $x$. From $\sec^2 y = 1 + \tan^2 y$,

\begin{equation}\cos^2 y = \dfrac{1}{\sec^2 y} = \dfrac{1}{1 + \tan^2 y} = \dfrac{1}{1 + x^2} \label{eq:1-36-2}\end{equation}

Substituting $\eqref{eq:1-36-2}$ into $\eqref{eq:1-36-1}$,

\begin{equation}\dfrac{d}{dx} \arctan x = \dfrac{1}{1 + x^2} \label{eq:1-36-3}\end{equation}

Note: This result implies $\displaystyle \displaystyle\int \dfrac{1}{1+x^2} dx = \arctan x + C$, which is frequently used in integration.

1.8 Derivatives of Hyperbolic Functions

1.37 Derivative of the Hyperbolic Sine

Formula: $\displaystyle\dfrac{d}{dx} \sinh x = \cosh x$

Proof

We differentiate using the definition $\sinh x = \dfrac{e^x - e^{-x}}{2}$.

\begin{equation}\dfrac{d}{dx} \sinh x = \dfrac{d}{dx} \dfrac{e^x - e^{-x}}{2} = \dfrac{1}{2} \left( \dfrac{d}{dx} e^x - \dfrac{d}{dx} e^{-x} \right) \label{eq:1-37-1}\end{equation}

By 1.20 and the chain rule, $\displaystyle \dfrac{d}{dx} e^x = e^x$ and $\displaystyle \dfrac{d}{dx} e^{-x} = -e^{-x}$.

\begin{equation}\dfrac{d}{dx} \sinh x = \dfrac{1}{2} (e^x - (-e^{-x})) = \dfrac{e^x + e^{-x}}{2} = \cosh x \label{eq:1-37-2}\end{equation}

1.38 Derivative of the Hyperbolic Cosine

Formula: $\displaystyle\dfrac{d}{dx} \cosh x = \sinh x$

Proof

We differentiate using the definition $\cosh x = \dfrac{e^x + e^{-x}}{2}$.

\begin{equation}\dfrac{d}{dx} \cosh x = \dfrac{d}{dx} \dfrac{e^x + e^{-x}}{2} = \dfrac{1}{2} \left( \dfrac{d}{dx} e^x + \dfrac{d}{dx} e^{-x} \right) \label{eq:1-38-1}\end{equation}

\begin{equation}\dfrac{d}{dx} \cosh x = \dfrac{1}{2} (e^x + (-e^{-x})) = \dfrac{e^x - e^{-x}}{2} = \sinh x \label{eq:1-38-2}\end{equation}

Note: Unlike trigonometric functions, $(\cosh x)' = \sinh x$ has no negative sign. This corresponds to the hyperbolic identity $\cosh^2 x - \sinh^2 x = 1$ (1.10).

1.39 Derivative of the Hyperbolic Tangent

Formula: $\displaystyle\dfrac{d}{dx} \tanh x = \text{sech}^2 x = 1 - \tanh^2 x$

Proof

We apply the quotient rule to $\tanh x = \dfrac{\sinh x}{\cosh x}$.

\begin{equation}\dfrac{d}{dx} \tanh x = \dfrac{(\sinh x)' \cosh x - \sinh x (\cosh x)'}{\cosh^2 x} \label{eq:1-39-1}\end{equation}

Substituting $(\sinh x)' = \cosh x$ and $(\cosh x)' = \sinh x$ from 1.37 and 1.38,

\begin{equation}\dfrac{d}{dx} \tanh x = \dfrac{\cosh x \cdot \cosh x - \sinh x \cdot \sinh x}{\cosh^2 x} = \dfrac{\cosh^2 x - \sinh^2 x}{\cosh^2 x} \label{eq:1-39-2}\end{equation}

By the hyperbolic identity (1.10) $\cosh^2 x - \sinh^2 x = 1$,

\begin{equation}\dfrac{d}{dx} \tanh x = \dfrac{1}{\cosh^2 x} = \text{sech}^2 x \label{eq:1-39-3}\end{equation}

Also, $\text{sech}^2 x = 1 - \tanh^2 x$ holds (since $\dfrac{1}{\cosh^2 x} = \dfrac{\cosh^2 x - \sinh^2 x}{\cosh^2 x}$).

Note: $\tanh$ is used as an activation function in neural networks. The gradient $1 - \tanh^2 x$ can be computed from $\tanh x$ itself, making it efficient for backpropagation.

1.9 Other Important Differentiation Formulas

1.40 Derivative of the Absolute Value Function

Formula: $\displaystyle\dfrac{d}{dx} |x| = \text{sgn}(x) = \begin{cases} 1 & (x > 0) \\ -1 & (x < 0) \end{cases}$

Condition: $x \neq 0$ (not differentiable at $x = 0$)

Proof

We can write $|x| = \sqrt{x^2}$. Applying the chain rule,

setting $u = x^2$, we have $|x| = u^{1/2}$.

\begin{equation}\dfrac{d}{dx}|x| = \dfrac{d(u^{1/2})}{du} \cdot \dfrac{du}{dx} = \dfrac{1}{2}u^{-1/2} \cdot 2x = \dfrac{x}{\sqrt{x^2}} = \dfrac{x}{|x|} \label{eq:1-40-1}\end{equation}

For $x > 0$, $\dfrac{x}{|x|} = \dfrac{x}{x} = 1$; for $x < 0$, $\dfrac{x}{|x|} = \dfrac{x}{-x} = -1$.

\begin{equation}\dfrac{d}{dx}|x| = \text{sgn}(x) = \begin{cases} 1 & (x > 0) \\ -1 & (x < 0) \end{cases} \label{eq:1-40-2}\end{equation}

Note: At $x = 0$, the left and right limits do not agree, so the function is not differentiable. Used in the subgradient of L1 regularization $\|\boldsymbol{w}\|_1$ in machine learning.

1.41 Derivative of the Sigmoid Function

Formula: $\displaystyle\dfrac{d}{dx} \sigma(x) = \sigma(x)(1 - \sigma(x))$

Condition: $\displaystyle \sigma(x) = \dfrac{1}{1 + e^{-x}}$ (sigmoid function)

Proof

We differentiate $\sigma(x) = \dfrac{1}{1 + e^{-x}} = (1 + e^{-x})^{-1}$ using the chain rule.

Setting $u = 1 + e^{-x}$, we have $\sigma = u^{-1}$.

\begin{equation}\dfrac{d\sigma}{dx} = \dfrac{d(u^{-1})}{du} \cdot \dfrac{du}{dx} = (-u^{-2}) \cdot (-e^{-x}) = \dfrac{e^{-x}}{(1 + e^{-x})^2} \label{eq:1-41-1}\end{equation}

We express this in terms of $\sigma(x)$. Since $\sigma = \dfrac{1}{1 + e^{-x}}$,

\begin{equation}1 - \sigma = 1 - \dfrac{1}{1 + e^{-x}} = \dfrac{e^{-x}}{1 + e^{-x}} \label{eq:1-41-2}\end{equation}

Therefore,

\begin{equation}\sigma(1 - \sigma) = \dfrac{1}{1 + e^{-x}} \cdot \dfrac{e^{-x}}{1 + e^{-x}} = \dfrac{e^{-x}}{(1 + e^{-x})^2} \label{eq:1-41-3}\end{equation}

Comparing $\eqref{eq:1-41-1}$ and $\eqref{eq:1-41-3}$,

\begin{equation}\dfrac{d\sigma}{dx} = \sigma(1 - \sigma) \label{eq:1-41-4}\end{equation}

Note: The sigmoid function is used as an activation function in neural networks and in logistic regression. Since the gradient can be computed from $\sigma$ itself, it is efficient for backpropagation. The maximum value is $\sigma'(0) = 0.25$ at $x = 0$.

1.42 Derivative of the Softplus Function

Formula: $\displaystyle\dfrac{d}{dx} \ln(1 + e^x) = \sigma(x) = \dfrac{1}{1 + e^{-x}}$

Proof

We differentiate $f(x) = \ln(1 + e^x)$ (the Softplus function) using the chain rule.

Setting $u = 1 + e^x$, we have $f = \ln u$.

\begin{equation}\dfrac{df}{dx} = \dfrac{d(\ln u)}{du} \cdot \dfrac{du}{dx} = \dfrac{1}{u} \cdot e^x = \dfrac{e^x}{1 + e^x} \label{eq:1-42-1}\end{equation}

Multiplying numerator and denominator by $e^{-x}$,

\begin{equation}\dfrac{df}{dx} = \dfrac{e^x \cdot e^{-x}}{(1 + e^x) \cdot e^{-x}} = \dfrac{1}{e^{-x} + 1} = \dfrac{1}{1 + e^{-x}} = \sigma(x) \label{eq:1-42-2}\end{equation}

Note: Softplus is a smooth approximation of ReLU $\max(0, x)$. The relation $\displaystyle \dfrac{d}{dx} \text{softplus}(x) = \sigma(x)$ means that Softplus is the "integral of the sigmoid."

1.43 Leibniz Formula (Product of Higher-Order Derivatives)

Formula: $\displaystyle (fg)^{(n)} = \displaystyle\sum_{k=0}^{n} \binom{n}{k} f^{(k)} g^{(n-k)}$

Condition: $f, g$ are $n$ times differentiable

Proof

We prove this by mathematical induction.

Base case ($n = 1$):

\begin{equation}(fg)' = f'g + fg' = \binom{1}{0}f^{(0)}g^{(1)} + \binom{1}{1}f^{(1)}g^{(0)} \label{eq:1-43-1}\end{equation}

This agrees with the product rule (1.25).

Inductive step:

Assume the formula holds for $n = m$.

\begin{equation}(fg)^{(m)} = \displaystyle\sum_{k=0}^{m} \binom{m}{k} f^{(k)} g^{(m-k)} \label{eq:1-43-2}\end{equation}

We show the case $n = m + 1$. Differentiating both sides of $\eqref{eq:1-43-2}$,

\begin{equation}(fg)^{(m+1)} = \displaystyle\sum_{k=0}^{m} \binom{m}{k} \left( f^{(k+1)} g^{(m-k)} + f^{(k)} g^{(m-k+1)} \right) \label{eq:1-43-3}\end{equation}

Rearranging $\eqref{eq:1-43-3}$ and applying Pascal's identity (1.4) $\binom{m}{k-1} + \binom{m}{k} = \binom{m+1}{k}$,

\begin{equation}(fg)^{(m+1)} = \displaystyle\sum_{k=0}^{m+1} \binom{m+1}{k} f^{(k)} g^{(m+1-k)} \label{eq:1-43-4}\end{equation}

Note: This formula is the differentiation analogue of the binomial theorem and is used in Taylor expansion computations.

References

Petersen, K. B., & Pedersen, M. S. (2012). The Matrix Cookbook. Technical University of Denmark.
Magnus, J. R., & Neudecker, H. (1999). Matrix Differential Calculus with Applications in Statistics and Econometrics (Revised ed.). Wiley.
Matrix calculus - Wikipedia