Automatic Differentiation and Optimization

Practical applications of vector/matrix calculus

Introduction

Deep-learning frameworks such as PyTorch, JAX, and TensorFlow provide a feature called automatic differentiation. Thanks to it, gradients of complicated neural networks can be obtained without hand-computing derivatives — simply writing the forward pass is enough.

AD is a powerful convenience, but if we do not understand what happens inside, we can run into unexpected behavior or write needlessly inefficient code. Building on the knowledge in the vector/matrix calculus formula collection and tensor-calculus primer, this page explains how AD works and how it connects to optimization.

Fundamentals of automatic differentiation

Let us first pin down what automatic differentiation (AD) actually is.

Contrast with numerical and symbolic differentiation

There are three standard approaches to computing derivatives.

Method Idea Drawback
Numerical Approximate $\displaystyle\dfrac{f(x+h) - f(x)}{h}$ Round-off vs. truncation trade-off; expensive in many variables
Symbolic Differentiate formulas algebraically Expression swell; hard to apply to control flow
Automatic View the function as a composition; apply the chain rule mechanically Requires a computational graph

AD avoids the approximation error of numerical methods and the expression swell of symbolic methods. By treating a function as a composition of elementary operations and stitching local derivatives with the chain rule, AD computes derivatives both efficiently and exactly.

The essence of AD

The core idea can be stated simply.

Principle of automatic differentiation
View the function as a chain of compositions and stitch local derivatives together with the chain rule. Instead of explicitly constructing the Jacobian — or a higher-order tensor such as the 4-th order Jacobian of a fully connected layer — compute only the required action (the product with a vector).

The word action is key. In matrix calculus one routinely speaks of the Jacobian $\boldsymbol{J}$, but AD never materializes this matrix. Instead it computes only the result of applying the Jacobian to a vector.

Theory vs. implementation: tensor calculus vs. vectorization

Written out mathematically, deep-learning computations often involve higher-order tensors. For a fully connected layer $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W}$ (with $\boldsymbol{X}$ of size $m \times n$, $\boldsymbol{W}$ of size $n \times p$, $\boldsymbol{Y}$ of size $m \times p$), the Jacobian of $\boldsymbol{Y}$ with respect to $\boldsymbol{W}$ is a 4-th order tensor.

Real frameworks (PyTorch, JAX, TensorFlow, ...) never build such tensors explicitly. They treat each entry of a tensor as a vector component and compute with vector-matrix operations instead.

Aspect Theory (mathematical description) Implementation (inside the framework)
Representation of derivatives 4-th order (or higher) tensor Vectorized matrix/vector operations
Shape of the gradient $\displaystyle\dfrac{\partial L}{\partial \boldsymbol{W}}$ has the same shape as $\boldsymbol{W}$ Same (restored by reshape)
Memory footprint $O(n^4)$ (explicit tensor) $O(n^2)$ (element-wise backprop)

Why vectorize?

  1. Compute efficiency. GPUs and TPUs are highly optimized for matrix multiplication (GEMM). Reducing tensor contractions to matrix multiplies is much faster than implementing them directly.
  2. Memory efficiency. Materializing a 4-th order Jacobian costs $O(n^4)$ memory, whereas element-wise backprop needs only $O(n^2)$.
  3. Implementation simplicity. If each primitive operation comes with a local gradient rule, the chain rule wires them up automatically — no need to assemble the full tensor Jacobian.

Example: gradient of a fully connected layer

Consider $\boldsymbol{Y} = \boldsymbol{X}\boldsymbol{W}$. The theoretical gradient is a 4-th order tensor, but in practice it is computed with matrix multiplications as shown below. Following standard ML notation we abbreviate the gradient of the loss $L$ by $\mathtt{dA} := \partial L / \partial \boldsymbol{A}$ (distinct from the differential form $d\boldsymbol{A}$ in mathematics). Here $\mathtt{@}$ denotes matrix multiplication and $\mathtt{.T}$ denotes transpose (Python / NumPy / PyTorch conventions).

# Suppose the upstream gradient is dY = ∂L/∂Y (same shape as Y)
dW = X.T @ dY   # ∂L/∂W = Xᵀ (∂L/∂Y)
dX = dY @ W.T   # ∂L/∂X = (∂L/∂Y) Wᵀ

The result matches the one obtained via tensor calculus. In other words, frameworks compute "the gradient having the same shape as what tensor calculus predicts" via efficient vectorized matrix operations.

In short, deep learning is tensor calculus mathematically, but implementation shortcuts exploit the structure. Understanding the theory tells you why those shortcuts are correct.

VJP and JVP

Automatic differentiation has two modes. Understanding both, and when each applies, is essential.

JVP (Jacobian-vector product) — forward mode

For a function $\boldsymbol{f}: \mathbb{R}^n \to \mathbb{R}^m$ and an input direction $\boldsymbol{v} \in \mathbb{R}^n$, JVP computes $$ \text{JVP}(\boldsymbol{v}) = \boldsymbol{J} \boldsymbol{v} = D\boldsymbol{f}(\boldsymbol{x})[\boldsymbol{v}] $$ where $\boldsymbol{J}$ is the Jacobian.

Intuitively, it tells you how the function changes along the direction $\boldsymbol{v}$. It is efficient when the input dimension $n$ is small: a single JVP recovers one column of the Jacobian.

VJP (vector-Jacobian product) — reverse mode

Given an output cotangent (adjoint) vector $\boldsymbol{w} \in \mathbb{R}^m$, VJP computes $$ \text{VJP}(\boldsymbol{w}) = \boldsymbol{J}^\top \boldsymbol{w} = D\boldsymbol{f}(\boldsymbol{x})^*[\boldsymbol{w}] $$

Intuitively it pulls back the sensitivity information from the output to the input. It is efficient when the output dimension $m$ is small: a single VJP yields one row of the Jacobian.

Why VJP dominates in deep learning

In training, the final output is a scalar loss, whereas the input (parameters) can be millions or billions of dimensions.

  • JVP (forward): need as many evaluations as the input dimension → millions.
  • VJP (reverse): need as many evaluations as the output dimension → one.

Hence deep learning uses VJP (reverse mode) almost exclusively. That is exactly the mathematical identity of backpropagation.

Connection to matrix calculus

Let us make the bridge between AD and matrix calculus explicit.

Agreement with the denominator layout

As explained in the tensor-calculus primer, in the denominator layout the chain rule expresses naturally as a tensor contraction. VJP performs exactly this contraction, so it aligns perfectly with the denominator-layout perspective.

Concept in matrix calculus Its counterpart in AD
Jacobian $\boldsymbol{J}$ Never materialized
Gradient $\nabla f$ Result of VJP$(1)$
Chain rule as contraction Sequential application of VJP
Gradient shape = input shape Standard API of every framework

PyTorch's .backward() and JAX's grad() return gradients that match the shape of the input precisely because of the denominator-layout convention.

Why transposes "disappear"

Matrix-calculus textbooks are full of $\boldsymbol{J}^\top$. In AD implementations, however, explicit transposes are rare: VJP does not "build the matrix and then transpose-multiply", it directly computes the operation that would correspond to the transposed multiplication.

Transposes are often a source of confusion when learning matrix calculus. From the AD perspective, they are merely a side effect of matrix representation; the essence is the action of differentiation, which is what VJP/JVP compute.

Connection to optimization algorithms

Gradients produced by AD feed directly into optimization algorithms. Let us organize the main families.

A unified template

Many optimizers can be written in a single form,

$$ \boldsymbol{x}_{k+1} = \boldsymbol{x}_k - \alpha_k \boldsymbol{P}_k \boldsymbol{g}_k $$

where $\boldsymbol{g}_k = \nabla f(\boldsymbol{x}_k)$ is the gradient, $\boldsymbol{P}_k$ is a preconditioner, and $\alpha_k$ is a step size. Differences between algorithms amount to different designs of $\boldsymbol{P}_k$.

Major algorithms

Algorithm Preconditioner $\boldsymbol{P}_k$ Derivative info required Characteristics
SGD $\boldsymbol{I}$ (identity) Gradient only Simple, robust to noise
Momentum Exponential moving average of past gradients Gradient only Damps oscillations
Adam $\text{diag}(1/\sqrt{\boldsymbol{v}_k + \epsilon})$ Gradient only Per-parameter adaptive step
L-BFGS Low-rank approximation of inverse Hessian Gradients (built from differences) Near-quadratic convergence
Newton's method $\boldsymbol{H}^{-1}$ Gradient and Hessian Quadratic convergence; impractical at scale

Why Adam is so popular

Adam tracks first- and second-moment estimates of the gradient (mean and variance) and adapts the per-coordinate step size. It can be viewed as an approximation of a diagonal Hessian, implicitly using second-order information.

It only needs the gradient, computable with a single VJP, and it automatically absorbs per-parameter scale differences. Hence its near-universal use in deep learning.

Hessian-vector product (HVP)

Applying Newton's method directly requires the Hessian $\boldsymbol{H}$ and its inverse — infeasible at scale. But one can use the Hessian as an action.

$$ \text{HVP}(\boldsymbol{v}) = \boldsymbol{H} \boldsymbol{v} = \nabla(\nabla f \cdot \boldsymbol{v}) $$

This is computed as the gradient of the inner product between a gradient and a vector. Combining VJP and JVP, one evaluates Hessian-vector products without ever materializing $\boldsymbol{H}$. Coupled with conjugate gradients this yields an approximate Newton's method.

Numerical stability

Even mathematically correct formulas can behave unexpectedly in finite-precision floating-point. A baseline understanding of numerical stability is indispensable.

Properties of floating-point arithmetic

  • Constant relative accuracy. Both large and small numbers are represented with comparable relative precision.
  • Catastrophic cancellation. Subtracting nearly equal numbers loses significant digits.
  • Overflow/underflow. Values outside the representable range become infinity or zero.

Typical issues in deep learning

Exploding / vanishing gradients

In deep networks, the chain rule multiplies many matrix actions together. Per-layer amplification or attenuation accumulates, and gradients can explode or vanish.

Ill-conditioning

A large condition number of the Hessian (ratio of largest to smallest eigenvalue) makes optimization hard. It corresponds to elongated contour ellipses of the loss and slows down gradient descent.

Techniques for stable computation

The log-sum-exp trick

Naively computing $\log \displaystyle\sum_i e^{x_i}$ overflows for large $x_i$. The following rewrite is stable.

$$ \log \displaystyle\sum_i e^{x_i} = m + \log \displaystyle\sum_i e^{x_i - m} \quad \text{where} \quad m = \max_i x_i $$

Fused softmax + cross-entropy

Computing softmax and cross-entropy separately is prone to intermediate numerical issues. Fusing them improves stability. PyTorch's CrossEntropyLoss internally performs this fusion.

The role of normalization

Normalization layers such as Batch Normalization and Layer Normalization are not mere tricks to speed up training; they play an essential role from the standpoint of numerical stability and optimization.

For instance, Batch Normalization first standardizes its input using the mini-batch mean $\mu$ and variance $\sigma^2$, $$\hat{x} = \dfrac{x - \mu}{\sqrt{\sigma^2 + \epsilon}},$$ and then applies a learnable scale $\gamma$ and shift $\beta$ to produce $y = \gamma \hat{x} + \beta$. The three subsections below analyze why this operation helps optimization.

Why normalization is needed

1. Improved conditioning

When the scale of inputs varies across layers, loss contours become distorted and optimization suffers. Normalizing outputs per layer improves the conditioning of the Hessian and makes gradient descent more efficient.

2. Control of the gradient dynamic range

Normalization keeps the magnitude of the gradients flowing through each layer in a reasonable range, mitigating both explosion and vanishing. In the language of the chain rule of AD, the scales of individual factors are kept under control.

3. Smoother loss landscape

Normalization is known to smooth the loss landscape. On a smoother loss surface, gradients point to better directions and larger learning rates remain stable.

What the $\epsilon$ in Adam really does

Adam's update adds a small $\epsilon$ (typically around $10^{-8}$) in the denominator.

$$ \boldsymbol{x} \leftarrow \boldsymbol{x} - \alpha \dfrac{\boldsymbol{m}}{\sqrt{\boldsymbol{v}} + \epsilon} $$

Intuitively, it is a safety device that prevents updates from exploding when the second moment $v_i$ of the gradient is tiny. More mathematically, it caps the largest entry of the preconditioner $\text{diag}(1/\sqrt{\boldsymbol{v}_k + \epsilon})$, which in turn caps the per-parameter effective condition number. It is not merely a guard against division by zero but a load-bearing component of optimization stability.

Summary

  • AD is the mechanical application of the chain rule on a computational graph.
  • In deep learning, because the output is scalar (the loss), reverse mode (VJP) is the efficient choice.
  • VJP aligns with the denominator-layout chain rule, and it explains why gradients have the same shape as inputs.
  • Optimization algorithms differ in how they precondition the gradient; that is the unifying view.
  • Numerical stability — via algebraic reformulation and normalization — is essential in practice.
  • Normalization supports optimization by improving conditioning, controlling gradient scale, and smoothing the loss surface.
Related pages

References

  • Griewank, A., & Walther, A. (2008). Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation (2nd ed.). SIAM. doi:10.1137/1.9780898717761
  • Baydin, A. G., Pearlmutter, B. A., Radul, A. A., & Siskind, J. M. (2018). Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research, 18(153), 1–43. arXiv:1502.05767
  • Pearlmutter, B. A. (1994). Fast Exact Multiplication by the Hessian. Neural Computation, 6(1), 147–160. doi:10.1162/neco.1994.6.1.147
  • Nocedal, J., & Wright, S. J. (2006). Numerical Optimization (2nd ed.). Springer. doi:10.1007/978-0-387-40065-5
  • Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms (2nd ed.). SIAM. doi:10.1137/1.9780898718027