LU, Cholesky, LDL Factorizations and Pivoting
Overview
The fundamental route to solving $A\mathbf{x} = \mathbf{b}$ is Gaussian elimination together with its factored form, LU decomposition. For symmetric matrices the cost is halved by Cholesky or LDL factorization, and choosing the right factorization for the structure of $A$ (positive definite, indefinite, sparse, integer-valued) simultaneously optimizes flops, memory, and numerical stability.
This page covers the mathematical structure of LU, Cholesky, and LDL factorizations; the meaning of pivoting strategies; the Bareiss algorithm for exact integer determinants; blocked LU as a BLAS-3 kernel; Bunch-Kaufman block-diagonal pivoting; and incomplete Cholesky preconditioners, then explains how sangi dispatches to each implementation.
For API usage, see LinAlg API — LU, Cholesky, LDL, Gaussian elimination, and pivot strategies.
Gaussian Elimination and LU
Gaussian elimination operates on the augmented matrix $[A \mid \mathbf{b}]$: forward elimination reduces it to upper triangular form, and back substitution recovers the solution. Writing the forward elimination as a product of matrices yields LU decomposition.
At stage $k$ the Frobenius transformation
$$M_k = I - \mathbf{m}_k \mathbf{e}_k^\top, \quad m_{ik} = a_{ik}^{(k)} / a_{kk}^{(k)} \;\; (i > k)$$
zeros the subdiagonal entries of column $k$. Composing $n-1$ such steps gives
$$M_{n-1} \cdots M_1 A = U \quad\Longleftrightarrow\quad A = (M_1^{-1} \cdots M_{n-1}^{-1}) U = LU,$$
where each $M_k^{-1} = I + \mathbf{m}_k \mathbf{e}_k^\top$ contributes one column of the unit lower-triangular factor $L$. Adding pivoting yields $PA = LU$ with $P$ a permutation.
Gaussian elimination and LU factorization are two notations for the same computation; both cost $O(n^3)$ flops.
The advantage of returning $L$ and $U$ explicitly is reuse: when many right-hand sides share the same coefficient matrix,
$A\mathbf{x}_i = \mathbf{b}_i$ can be solved in $O(n^2)$ per right-hand side via triangular substitution.
sangi exposes both the factored form via lu_decomposition and the direct solve via gaussian_elimination.
See also: LU Decomposition / Numerical Analysis — LU
Pivoting Strategies
When $a_{kk}^{(k)}$ vanishes or is small, the multipliers $m_{ik}$ explode and cancellation destroys the result.
sangi offers three strategies via the PivotStrategy enumeration.
None
Uses the diagonal entry directly without swaps. Safe only for diagonally dominant or structurally non-singular matrices. Minimum overhead, but catastrophic on ill-conditioned inputs.
Partial Pivoting
At stage $k$, exchange row $k$ with the row containing the column-$k$ entry of largest magnitude on or below the diagonal:
$$p = \arg\max_{i \geq k} \, |a_{ik}^{(k)}|.$$
Per-stage cost is $O(n)$, total $O(n^2)$. In practice it suffices for almost all numerically reasonable matrices, and it is the default in sangi.
Full Pivoting
Search the entire remaining submatrix $A^{(k)}_{[k:n,k:n]}$ for the largest entry and swap both its row and column to position $(k,k)$:
$$(p, q) = \arg\max_{i,j \geq k} \, |a_{ij}^{(k)}|.$$
Column swaps require an additional permutation $Q$, with the solution recovered as $\mathbf{x} = Q (LU)^{-1} P \mathbf{b}$. Per-stage cost is $O((n-k)^2)$ and total $O(n^3)$ — roughly double partial pivoting. Useful as a safety net for nearly rank-deficient or pathologically ill-conditioned matrices.
Rook and Threshold Pivoting
Rook pivoting (named after the chess piece) picks an entry that is simultaneously the largest in its row and largest in its column. Its average cost approaches $O(n^2)$ while providing stability close to full pivoting. Threshold pivoting accepts the current candidate if its magnitude is at least $\tau$ times the maximum candidate ($0 < \tau \le 1$); this is the standard tool for controlling fill-in in sparse direct solvers. sangi uses Partial / Full for dense LU and threshold pivoting internally for sparse LU.
Rational-Type Specialization
When $T = $ Rational, the pivot is chosen by minimum height, where
$h(p/q) = \max(|p|, |q|)$.
Selecting low-height pivots prevents exponential growth of numerators and denominators during subsequent eliminations.
Since rational arithmetic is exact, the goal here is computational cost rather than stability.
Wilkinson's Growth Factor and Stability
The ratio of the maximum intermediate entry to the maximum input entry,
$$\rho_n(A) = \frac{\max_{i,j,k} \, |a_{ij}^{(k)}|}{\max_{i,j} \, |a_{ij}|},$$
is called the growth factor. Backward error analysis (Wilkinson 1961) shows that the computed solution $\hat{\mathbf{x}}$ is the exact solution of a perturbed system $(A + \Delta A) \hat{\mathbf{x}} = \mathbf{b}$ with $\|\Delta A\|_\infty$ bounded by $\rho_n(A) \cdot u \cdot \|A\|_\infty$ ($u$ being unit roundoff). Smaller growth therefore means smaller backward error.
- Partial pivoting: worst case $\rho_n \le 2^{n-1}$, but typically $\rho_n \le 10$. Exponential growth occurs only on artificially constructed Wilkinson matrices.
- Full pivoting: $\rho_n \le n^{1/2}(2 \cdot 3^{1/2} \cdots n^{1/(n-1)})^{1/2}$, polynomial in $n$.
- Symmetric positive definite (Cholesky): $\rho_n = 1$. No growth; pivoting is unnecessary.
Ill-conditioning is measured by $\kappa(A) = \|A\| \cdot \|A^{-1}\|$. When $\kappa(A)$ approaches $1/u$ even pivoted LU loses all
significant digits, and SVD-based regularization or iterative refinement becomes necessary;
see svd_solve with its rcond damping parameter.
See also: Condition Number / Wilkinson Growth Factor
Bareiss Algorithm (Integer Determinants)
Standard floating-point LU contaminates an integer determinant with roundoff, while rational LU keeps the result exact but causes exponential blow-up of numerator and denominator sizes. The Bareiss algorithm computes the determinant of an integer matrix with only integer arithmetic and no genuine division, based on Sylvester's determinant identity.
At stage $k$ the elimination update is
$$M_{ij}^{(k)} = \frac{M_{ij}^{(k-1)} M_{kk}^{(k-1)} - M_{ik}^{(k-1)} M_{kj}^{(k-1)}}{M_{k-1,k-1}^{(k-2)}}.$$
The numerator is a $2 \times 2$ minor; Sylvester's identity guarantees that it is exactly divisible by $M_{k-1,k-1}^{(k-2)}$, so this "division" is an integer division. The final pivot $M_{n-1,n-1}^{(n-2)}$ equals $\det(A)$.
Complexity is $O(n^3)$, but intermediate integers can grow large, so multi-precision integers (sangi's Int) are required.
By Hadamard's inequality $|\det A| \le \prod_i \|\mathbf{a}_i\|_2$, intermediate bit lengths are bounded by $O(n \log(nM))$
where $M$ is the largest input magnitude.
sangi's Matrix::determinant() dispatches to bareiss_determinant automatically when
numeric_traits<T>::is_integer is true. Floating-point types follow the LU path
(lu_determinant, product of diagonals times the sign of the permutation).
Blocked LU and BLAS-3 Optimization
Naive LU is dominated by Level-2 BLAS (axpy-like) operations whose flop-to-byte ratio is bounded by memory bandwidth.
Blocked LU partitions $A$ into $b \times b$ tiles and recasts each update step as a matrix multiply (Level-3 BLAS gemm),
maximizing cache reuse.
Partition
$$A = \begin{pmatrix} A_{11} & A_{12} \\ A_{21} & A_{22} \end{pmatrix}.$$
Factor the $b \times b$ block $A_{11} = L_{11} U_{11}$ with the naive algorithm, then compute
$$L_{21} = A_{21} U_{11}^{-1}, \quad U_{12} = L_{11}^{-1} A_{12}, \quad A_{22}^{(1)} = A_{22} - L_{21} U_{12}.$$
Here $L_{21}$ and $U_{12}$ are obtained via triangular solves (trsm, Level-3), and the Schur complement update is a single
gemm. Each step has flop-to-memory ratio $O(1/b)$, far better than the unblocked version.
Recursion on the $(n-b) \times (n-b)$ Schur complement gives the full factorization.
Block size $b$ is chosen from L1/L2 cache sizes and arithmetic intensity; typical values are $b = 32 \sim 128$.
sangi invokes dgetrf through MKL/OpenBLAS when available, and uses an internal panel-update implementation otherwise.
See also: Strassen and Block Updates
Cholesky Factorization
If $A$ is symmetric and positive definite ($\mathbf{x}^\top A \mathbf{x} > 0$ for all $\mathbf{x} \neq 0$), there exists a unique lower-triangular $L$ with $A = LL^\top$.
The recursive formulas are
$$L_{kk} = \sqrt{A_{kk} - \sum_{j=1}^{k-1} L_{kj}^2}, \quad L_{ik} = \frac{1}{L_{kk}} \left(A_{ik} - \sum_{j=1}^{k-1} L_{ij} L_{kj}\right) \quad (i > k).$$
Symmetry halves the flop count to $n^3/3$, positive definiteness guarantees $L_{kk}^2 > 0$ so no pivoting is needed, and the growth factor $\rho_n = 1$ makes the factorization exceptionally stable.
Cholesky underlies least-squares normal equations $A^\top A \mathbf{x} = A^\top \mathbf{b}$, Kalman filter covariance updates, Gaussian-process log-likelihoods, and the diagonal blocks of KKT systems in quadratic programming — anywhere positive definiteness is structurally guaranteed.
A blocked variant uses Cholesky on the diagonal block, trsm off-diagonal, and syrk (symmetric rank-$k$ update)
for the Schur complement.
LDL Factorization and Bunch-Kaufman Pivoting
When symmetry holds but positive definiteness does not, the Cholesky square root fails. The remedy is $A = LDL^\top$ with unit lower-triangular $L$ and diagonal $D$:
$$D_{kk} = A_{kk} - \sum_{j=1}^{k-1} D_{jj} L_{kj}^2, \quad L_{ik} = \frac{1}{D_{kk}} \left(A_{ik} - \sum_{j=1}^{k-1} D_{jj} L_{ij} L_{kj}\right).$$
Eliminating the square root sidesteps the cost of $\sqrt{\,}$ and avoids failure on negative diagonal entries. Flop count is the same as Cholesky, $n^3/3$.
Bunch-Kaufman Pivoting
Plain LDL still fails when a pivot is small or zero. Bunch-Kaufman pivoting introduces a block-diagonal $D$ with $1 \times 1$ and $2 \times 2$ blocks:
$$D = \mathrm{blockdiag}(D_1, D_2, \ldots), \quad D_k \in \mathbb{R}^{1\times 1} \cup \mathbb{R}^{2\times 2}.$$
At each step the algorithm tests whether a symmetric row/column swap yields a safe $1 \times 1$ pivot, and otherwise groups two adjacent diagonal entries into a $2 \times 2$ block. The strategy preserves symmetry while handling indefinite or near-singular symmetric matrices.
Applications include saddle-point KKT systems in constrained optimization, indefinite Hessians in trust-region methods, and the block-diagonal saddle systems arising in symmetric interior-point methods.
See also: LDL Decomposition
Incomplete Cholesky as a Preconditioner
Iterative solvers (conjugate gradient, MINRES, etc.) on large sparse systems converge faster when accelerated by a preconditioner $M \approx A$. Incomplete Cholesky (IC) preserves the symmetric structure while limiting fill-in.
The level-zero variant $\mathrm{IC}(0)$ uses the sparsity pattern $\mathcal{S} = \{(i,j) : A_{ij} \neq 0\}$ and discards any fill-in produced by elimination:
$$\tilde{L}_{ik} = \begin{cases} (A_{ik} - \sum_{j<k} \tilde{L}_{ij} \tilde{L}_{kj}) / \tilde{L}_{kk} & (i,k) \in \mathcal{S} \\ 0 & \text{otherwise.} \end{cases}$$
The approximation $\tilde{M} = \tilde{L} \tilde{L}^\top$ is not $A$, but using $\tilde{M}^{-1} \mathbf{r}$ inside preconditioned CG (PCG) dramatically improves $\kappa(\tilde{M}^{-1} A)$. Variants $\mathrm{IC}(p)$ allow $p$ levels of fill-in, and modified IC adjusts the diagonal to match row sums.
Even for symmetric positive definite $A$, incomplete Cholesky can produce negative pivots; a small diagonal shift $\alpha I$ (Manteuffel shift) is a standard remedy.
Incomplete LU (ILU) is the asymmetric counterpart; sangi's sparse iterative solvers and their preconditioner choices are covered in the iterative-and-sparse page.
See also: Incomplete LU Preconditioner
sangi's Automatic Dispatch
sangi LinAlg factorization functions select their internal implementation based on the type and structure of the input.
Users typically only need to call the high-level lu_solve, cholesky_solve, ldl_solve wrappers.
| Input | Factorization used | Notes |
|---|---|---|
| Dense general square | LU + partial pivoting | Default; PivotStrategy::Full switches to full pivoting |
| Dense symmetric positive definite | Cholesky | Explicit cholesky_solve; MKL path uses dpotrf/dpotrs |
| Dense symmetric indefinite | LDL + Bunch-Kaufman | Explicit ldl_solve |
Integer determinant() | Bareiss | Division-free, all intermediates remain integer |
Matrix<Rational> | LU + minimum-height pivoting | Mitigates denominator blow-up |
| Sparse (CSR/CSC) | Simplicial Cholesky / sparse LU | See the sparse page |
By size, sangi uses naive LU for $n < 64$, blocked LU for $n \ge 64$ (MKL/OpenBLAS when available, internal fallback otherwise), and parallel panel factorization at $n \ge 1024$. Thresholds are empirically tuned and require no user action.
References
- Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed., Johns Hopkins University Press. Chapters 3–4.
- Wilkinson, J. H. (1961). "Error analysis of direct methods of matrix inversion". J. ACM, 8(3), 281–330.
- Bareiss, E. H. (1968). "Sylvester's identity and multistep integer-preserving Gaussian elimination". Math. Comp., 22(103), 565–578.
- Bunch, J. R. and Kaufman, L. (1977). "Some stable methods for calculating inertia and solving symmetric linear systems". Math. Comp., 31(137), 163–179.
- Higham, N. J. (2002). Accuracy and Stability of Numerical Algorithms, 2nd ed., SIAM.