Where is matrix calculus used in machine learning?

Matrix calculus is used throughout machine learning optimization, including backpropagation in neural networks, gradient computation for Attention mechanisms, batch normalization training, and the reparameterization trick in VAEs.

What is the difference between denominator layout and numerator layout?

In denominator layout, differentiating a scalar by a vector produces a column vector, while in numerator layout it produces a row vector. This page uses the denominator layout convention.

How do you compute the gradient of the Attention mechanism?

The gradient of Scaled Dot-Product Attention is computed using the Jacobian matrix of the softmax function and the chain rule with respect to the Query, Key, and Value matrices. Detailed derivations are given in formulas 17.8-17.12 on this page.

Applications to Machine Learning

Notation Convention
All formulas on this page use the denominator layout convention. See Layout Conventions for details.

Neural Networks

Activation Functions (6.9-6.10)
Fully Connected Layers (17.1-17.3)
Normalization Layers (17.4-17.7)
Attention Mechanisms (17.8-17.12)
Convolution & Pooling (17.13-17.17)
Regularization (17.18-17.19)
VAE (17.20-17.22)

SVD & Fisher Information

18.1 Gradient of SVD Backpropagation
18.2 Gradient of Singular Values
18.3 Definition of the Fisher Information Matrix
18.4 Hessian Representation of the Fisher Information Matrix
18.5 Natural Gradient

Reinforcement Learning

18.13 Policy Gradient Theorem
18.14 Policy Gradient with Baseline

Natural Language Processing

18.18 Gradient of Skip-gram (Negative Sampling)
18.19 Gradient of GloVe

Advanced Topics

18.15 InfoNCE Loss Function
18.18 Gradient of Cholesky Decomposition
18.13 Sinkhorn Distance
18.16 Gaussian Processes
18.23 Belief Propagation
18.26 Dictionary Learning & LASSO

Computer Vision

3.14 Differentiation of the Homography Matrix

Formula: $\displaystyle\frac{\partial \boldsymbol{p}'}{\partial \boldsymbol{H}} = \displaystyle\frac{1}{w'}\begin{pmatrix} \boldsymbol{p}^\top & \boldsymbol{0}^\top & -x'\boldsymbol{p}^\top \\ \boldsymbol{0}^\top & \boldsymbol{p}^\top & -y'\boldsymbol{p}^\top \end{pmatrix}$

Conditions: $\boldsymbol{p}' = \pi(\boldsymbol{H}\boldsymbol{p})$, $\pi$: normalization function, $w' = \boldsymbol{h}_3^\top \boldsymbol{p}$

Explanation

Gradient of point transformation under a 2D projective transformation (homography) $\boldsymbol{H} \in \mathbb{R}^{3 \times 3}$. Used in image registration, panoramic stitching, and augmented reality.

The homogeneous coordinates $\tilde{\boldsymbol{p}} = \boldsymbol{H}\boldsymbol{p}$ are normalized to obtain image coordinates $\boldsymbol{p}' = (x', y')^\top = (\tilde{p}_1/\tilde{p}_3, \tilde{p}_2/\tilde{p}_3)^\top$.

Medical Image Reconstruction

12.13 Gradient of Tikhonov Regularization

Formula: $\displaystyle\frac{\partial J}{\partial \boldsymbol{x}} = 2\boldsymbol{A}^\top(\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}) + 2\lambda\boldsymbol{L}^\top\boldsymbol{L}\boldsymbol{x}$

Conditions: $J(\boldsymbol{x}) = \|\boldsymbol{A}\boldsymbol{x} - \boldsymbol{y}\|^2 + \lambda\|\boldsymbol{L}\boldsymbol{x}\|^2$, $\boldsymbol{A}$: forward model matrix, $\boldsymbol{L}$: regularization matrix

Explanation

Regularization of the inverse problem in CT/MRI image reconstruction. Setting the gradient to zero yields the Tikhonov solution $\boldsymbol{x}^* = (\boldsymbol{A}^\top\boldsymbol{A} + \lambda\boldsymbol{L}^\top\boldsymbol{L})^{-1}\boldsymbol{A}^\top\boldsymbol{y}$.

12.15 Subgradient of Total Variation (TV) Regularization

Formula: $\displaystyle\frac{\partial}{\partial \boldsymbol{x}}\text{TV}(\boldsymbol{x}) = -\text{div}\left(\displaystyle\frac{\nabla \boldsymbol{x}}{|\nabla \boldsymbol{x}|}\right)$

Conditions: $\text{TV}(\boldsymbol{x}) = \|\nabla \boldsymbol{x}\|_1$, non-differentiable at $|\nabla \boldsymbol{x}| = 0$

Explanation

Total variation regularization removes noise while preserving edges, making it well-suited for medical imaging. In practice, it is smoothed as $|\nabla \boldsymbol{x}| + \epsilon$ ($\epsilon > 0$ is a small constant).

Detail Pages

For detailed formulas and proofs in each area, please see the following pages:

Table of Contents

Neural Networks

SVD & Fisher Information

Reinforcement Learning

Natural Language Processing

Advanced Topics

Computer Vision

3.14 Differentiation of the Homography Matrix

Explanation

Medical Image Reconstruction

12.13 Gradient of Tikhonov Regularization

Explanation

12.15 Subgradient of Total Variation (TV) Regularization

Explanation

Detail Pages