Matrix Arithmetic Algorithms

Overview

Matrix<T> is the dynamic-size matrix of sangi; StaticMatrix<T,M,N> is the compile-time fixed-size variant. Both are header-only via math/core/matrix.hpp. This article addresses internal layout and asymptotic behavior of arithmetic operations (addition, multiplication, transpose, elementwise product, Kronecker product, trace). Decompositions, inverses, and determinants are covered in Determinant and Inverse.

For usage details, see the Matrix API Reference.

Storage Layout: Row-Major and Column-Major

Row-Major Layout

sangi Matrix<T> uses row-major storage. Element $a_{ij}$ of an $M \times N$ matrix sits at linear offset $i \cdot N + j$.

$$\mathrm{offset}(i, j) = i \cdot N + j \quad (0 \leq i < M, 0 \leq j < N)$$

This matches C/C++ two-dimensional arrays, the default for Python NumPy, and the tensor layout in PyTorch / TensorFlow. Fortran, LAPACK, and BLAS, by contrast, use column-major storage. The mismatch can be absorbed by transposition combined with the leading-dimension parameter:

$$A_{\mathrm{rowmajor}} = (A^\top)_{\mathrm{colmajor}}$$

That is, a matrix $A$ stored row-major is equivalent to $A^\top$ viewed as column-major. When calling BLAS dgemm, either pass CblasRowMajor or rewrite $C = AB$ as $C^\top = B^\top A^\top$ with appropriate leading dimensions.

Static StaticMatrix<T,M,N>

StaticMatrix<T,M,N> uses std::array<T, M*N> as storage. No heap allocation occurs, and because $M$ and $N$ are compile-time constants, all loops are eligible for full unrolling. For $4 \times 4$ matrices (SE(3) transforms) or $3 \times 3$ (rotations), it is markedly faster than dynamic Matrix<T>.

Leading Dimension and Views

View classes such as MatrixMap<T> carry a leading dimension (LD). When extracting the top-left $r \times c$ block of an $M \times N$ matrix, the sub-matrix's rows are separated by padding:

$$\mathrm{offset}(i, j) = i \cdot LD + j, \quad LD = N \geq c$$

Memory traversal across rows of the sub-view is therefore not contiguous; SIMD packet loads remain valid only within a row.

Addition and Transpose

Matrix addition $C = A \pm B$ involves $M \cdot N$ elementwise operations and uses the same expression-template + SIMD machinery as Vector arithmetic. Complexity is $O(MN)$; the operation is memory-bandwidth bound and benefits directly from four-accumulator unrolling.

Transpose

The transpose $B = A^\top$ swaps elements mathematically, but because rows and columns cross in memory, the naive implementation incurs cache misses on every access. sangi adopts a cache-tile transpose that operates on $b \times b$ blocks small enough to fit in L1:

$$B[i][j] = A[j][i], \quad (i, j) \in [bi, bi+b) \times [bj, bj+b)$$

The tile size $b$ is chosen based on L1 capacity (with L1 = 32 KiB and $T = $ double at 8 B, accounting for two matrices, $b = 32$ is a good baseline). Within a $b \times b$ tile, SIMD shuffle instructions (the _MM_TRANSPOSE4_PS family) apply.

An alternative is a lazy "transpose view" that defers the swap, but sangi materializes the transpose eagerly: downstream operations such as matrix multiplication achieve consistent throughput when given pre-transposed data.

Related article: Foundations of Vector Spaces

General N×N Multiplication

For $C = AB$ with $A \in \mathbb{R}^{M \times K}$, $B \in \mathbb{R}^{K \times N}$, $C \in \mathbb{R}^{M \times N}$:

$$c_{ij} = \sum_{k=0}^{K-1} a_{ik} b_{kj}$$

The naive three-loop implementation is $O(MNK)$. With both $A$ and $B$ in row-major, access to $b_{kj}$ has stride $N$, hurting cache locality. sangi addresses this via the following three-tier dispatch.

Size range	Algorithm	Complexity
Small ($N \lesssim 64$)	Naive ijk loop + SIMD	$O(N^3)$
Medium ($64 \lesssim N \lesssim 512$)	Block multiplication with inner micro-kernel	$O(N^3)$
Large ($N \gtrsim 1024$)	Strassen with classical block at recursion base	$O(N^{\log_2 7}) \approx O(N^{2.807})$

When a BLAS backend (OpenBLAS / Intel MKL) is available, medium and large sizes can dispatch to dgemm. Even without BLAS, the in-tree block implementation captures most of the memory-hierarchy benefit.

Cache Blocking

Cache-aware GEMM follows the Goto-Van de Geijn framework, which uses three block sizes $m_c, k_c, n_c$ matched to L1, L2, and L3 capacities. sangi does not hand-tune the innermost micro-kernel; it relies on the compiler's SIMD auto-vectorization and tunes only the outer block sizes through machine-specific benchmarks.

$$C_{ij} \mathrel{+}= A_{ip} \cdot B_{pj}$$

(where $A_{ip}$ is $m_c \times k_c$, $B_{pj}$ is $k_c \times n_c$, and $C_{ij}$ is $m_c \times n_c$.)

$k_c$: chosen so an $A$ panel fits in L1. For double, $k_c \approx 256$
$m_c$: an $A$ panel plus a $B$ slice fit in L2. Typically $m_c \approx 96$
$n_c$: a $B$ panel fits in L3. Typically $n_c \approx 4096$

Packing $B$ block by block makes inner-loop access contiguous. Since packing happens once per outer iteration, the cost is amortized when $k_c$ and $n_c$ are large enough.

When a BLAS backend is selected, the library performs its own tuning. The in-tree implementation defaults to $m_c = 96, k_c = 256, n_c = 4096$ for portability, with an auto-tuning hook planned for future releases.

Strassen's Algorithm

Strassen's algorithm (1969) recursively reduces a $2 \times 2$ block multiplication from 8 sub-products to 7, achieving asymptotic complexity $O(N^{\log_2 7}) \approx O(N^{2.807})$.

Partition $A$, $B$, $C$ into $2 \times 2$ blocks:

$$A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}, \quad B = \begin{pmatrix} B_{11} & B_{12} \\ B_{21} & B_{22} \end{pmatrix}$$

Compute seven intermediate products:

$$\begin{aligned} M_1 &= (A_{11} + A_{22})(B_{11} + B_{22}) \\ M_2 &= (A_{21} + A_{22}) B_{11} \\ M_3 &= A_{11} (B_{12} - B_{22}) \\ M_4 &= A_{22} (B_{21} - B_{11}) \\ M_5 &= (A_{11} + A_{12}) B_{22} \\ M_6 &= (A_{21} - A_{11}) (B_{11} + B_{12}) \\ M_7 &= (A_{12} - A_{22}) (B_{21} + B_{22}) \end{aligned}$$

Reconstruct the result blocks by linear combination:

$$\begin{aligned} C_{11} &= M_1 + M_4 - M_5 + M_7 \\ C_{12} &= M_3 + M_5 \\ C_{21} &= M_2 + M_4 \\ C_{22} &= M_1 - M_2 + M_3 + M_6 \end{aligned}$$

Threshold and Numerical Stability

Strassen performs more additions and subtractions than the classical block method, so it is slower at small sizes. sangi uses a threshold of $N \approx 1024$ (for double) and falls back to the classical block routine at the recursion base.

The error bound of Strassen is looser than that of the classical algorithm, but in practice with IEEE 754 double precision and $N \leq 10^4$ the discrepancy is not detectable. Strassen is opt-in in sangi; it runs only on explicit request. For most workloads the classical block routine plus an available BLAS backend wins on constant factors.

Hadamard Product

The Hadamard (elementwise) product, denoted $\circ$:

$$(A \circ B)_{ij} = a_{ij} \cdot b_{ij}$$

Complexity is $O(MN)$ and uses the same expression-template + SIMD pipeline as matrix addition. Frequent appearances include gating in neural networks (LSTM/GRU forget gates) and elementwise products in probabilistic models.

$A \circ B$ should not be confused with the matrix product $AB$. The free function hadamard(A, B) is currently provided for the Tensor type only; the corresponding overload for Matrix is planned, and elementwise products on matrices are obtained today via expression templates or an explicit loop.

Kronecker Product

For $A \in \mathbb{R}^{M \times N}$ and $B \in \mathbb{R}^{P \times Q}$, the Kronecker product $A \otimes B$ produces an $MP \times NQ$ matrix:

$$(A \otimes B)_{(i p + k)(j q + l)} = a_{ij} \cdot b_{kl}$$

Complexity is $O(MNPQ)$, and the output size is the product of the inputs. Applications include tensor-product space representations, vectorization of the Sylvester equation $AX + XB = C$, and group representation theory.

The sangi implementation fills the output via a four-loop body. For each outer pair $(i, j)$ over $A$, a scaled copy of $B$ is written into the corresponding block, so the innermost loops ($k, l$ over $B$) write contiguous memory and vectorize on SIMD packets.

In practice, materializing $A \otimes B$ is often avoidable; the identity $\mathrm{vec}(BXA^\top) = (A \otimes B) \mathrm{vec}(X)$ allows evaluating only the action. The sangi Sylvester solver uses this implicit form.

Trace

$$\mathrm{tr}(A) = \sum_{i=0}^{N-1} a_{ii}$$

The sum of the diagonal of a square matrix; complexity is $O(N)$. Diagonal addressing uses offsets $i \cdot N + i$, i.e. stride $N + 1$, which precludes SIMD packet loads. Even for large $N$, a scalar loop is sufficient since the diagonal is sparse in memory.

The identity $\mathrm{tr}(AB) = \sum_{i,j} a_{ij} b_{ji}$ is heavily used. With trace(A * B) the expression template avoids materializing $AB$, but the fully fused form is exposed as a dedicated trace_product(A, B) function.

Dispatch for Unbalanced Shapes

When $M$, $K$, $N$ differ greatly in $A: M \times K$ and $B: K \times N$, memory-access patterns are skewed and a single block size cannot capture the optimization opportunity. Common cases:

Outer-product regime ($K \ll M, N$): small $K$, low-rank $C$; expressible directly as a sum of column vectors
Inner-product regime ($M, N \ll K$): small $C$, contraction dominated by the long $K$ axis; each element is a long dot product
Panel regime ($M \approx K, N \ll K$): tall panel times a slim matrix; analogous to BLAS Level 2.5

sangi dispatches on the dimension ratio $\rho = K / \min(M, N)$:

$\rho < 0.25$ → outer-product treatment (set of rank-1 updates)
$\rho > 4$ → inner-product treatment (long $k_c$-block reductions)
$0.25 \leq \rho \leq 4$ → standard block multiplication

Even extreme aspect ratios thereby retain reasonable throughput. BLAS expresses the same dispatch implicitly through dgemm's transA/transB and leading-dimension arguments, but the sangi in-tree implementation makes it explicit.

Design Decisions

Why Row-Major

Row-major aligns with C/C++ idioms and the default in Python NumPy / PyTorch, eliminating copies and transposes when exchanging data with external sources (images, tensors, Pandas DataFrames). LAPACK is column-major, but BLAS Level 3 routines accept CblasRowMajor, treating row-major natively. When column-major is required, a transpose view suffices and incurs asymptotically zero cost.

Why Strassen Is Off by Default

When a BLAS backend is available, for $N \leq 10^4$ the classical GEMM (hand-tuned micro-kernel plus parallelization) often outpaces Strassen. Each recursive level of Strassen incurs $O(N^2)$ additions and allocations, inflating the constant factor. sangi ships a portable implementation that runs without BLAS, but Strassen is enabled only on explicit request; the default path is the classical block routine, optionally backed by BLAS.

Parallelization

The outer blocks of matrix multiplication ($i, j$ block axes) admit OpenMP parallelization, but sangi leaves it disabled by default to avoid contending with process-wide parallelism (Python multiprocessing, TBB, external thread pools). Users opt into parallelism explicitly through a num_threads parameter.

References

Strassen, V. (1969). "Gaussian Elimination is not Optimal". Numerische Mathematik, 13, 354–356.
Goto, K. and van de Geijn, R. (2008). "Anatomy of High-Performance Matrix Multiplication". ACM Transactions on Mathematical Software, 34 (3), Article 12, 12:1–12:25.
Higham, N. J. Accuracy and Stability of Numerical Algorithms. 2nd ed. SIAM, 2002. (Error analysis of Strassen)