QR Factorization — Householder, Givens, MGS, Column Pivoting

Overview

QR factorization $A = QR$ writes an $m \times n$ matrix ($m \geq n$) as the product of an orthogonal $Q \in \mathbb{R}^{m \times m}$ and an upper-triangular $R \in \mathbb{R}^{m \times n}$. Orthogonal transformations preserve the condition number and propagate roundoff minimally, making QR one of the most stable factorizations in numerical linear algebra.

Three classical algorithms construct it, each with its own niche:

  • Householder reflections — bulk triangularization with minimum flops on dense matrices
  • Givens rotations — local single-entry zeroing, ideal for sparse, banded, or Hessenberg structure
  • Gram-Schmidt orthogonalization — column-by-column, essential to Krylov methods like Arnoldi and GMRES

This page covers the mathematical structure of each, their stability behavior, BLAS-3 blocking via the WY representation, rank-revealing column pivoting, and applications to least squares and eigenvalue computation.

For API usage, see LinAlg API — QR.

Householder Reflections

A Householder reflection

$$H = I - 2 \frac{\mathbf{v} \mathbf{v}^\top}{\mathbf{v}^\top \mathbf{v}}$$

is an orthogonal symmetric matrix representing reflection through the hyperplane perpendicular to $\mathbf{v}$. Choosing $\mathbf{v} = \mathbf{x} \pm \|\mathbf{x}\|_2 \mathbf{e}_1$ yields $H\mathbf{x} = \mp \|\mathbf{x}\|_2 \mathbf{e}_1$, aligning $\mathbf{x}$ with the first axis in one shot. The sign is taken as $\mathrm{sign}(x_1)$ to avoid cancellation when forming $\mathbf{v}$ (Wilkinson's sign rule).

The triangularization algorithm is:

  1. Form a reflector $H_1$ for the first column $\mathbf{a}_1$ and apply $H_1 A$ to zero its subdiagonal entries
  2. Recurse on the trailing $(m-1) \times (n-1)$ submatrix with $H_2, H_3, \ldots, H_n$
  3. The composition $H_n \cdots H_2 H_1 A = R$ is upper triangular; $Q = H_1 H_2 \cdots H_n$ (each $H_i$ is symmetric)

Flop count is $2mn^2 - \tfrac{2}{3}n^3$, asymptotically optimal for dense QR. $Q$ is typically not formed explicitly; instead the reflection vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ are stored in the zeroed subdiagonal of $R$, and $Q\mathbf{x}$ or $Q^\top \mathbf{b}$ is evaluated on demand via matrix-vector products. Memory drops from $O(m^2)$ to $O(mn)$.

Blocked Householder and the WY Representation

Applying $H_k$ one at a time uses Level-2 BLAS (matrix-vector products). The WY representation aggregates $b$ consecutive reflections into a single Level-3 BLAS multiplication.

Write the cumulative product as

$$H_1 H_2 \cdots H_k = I - W_k T_k W_k^\top,$$

where $W_k \in \mathbb{R}^{m \times k}$ stacks the reflection vectors and $T_k \in \mathbb{R}^{k \times k}$ is upper triangular, updated by

$$T_k = \begin{pmatrix} T_{k-1} & -\tau_k T_{k-1} W_{k-1}^\top \mathbf{v}_k \\ 0 & \tau_k \end{pmatrix}, \quad \tau_k = 2 / (\mathbf{v}_k^\top \mathbf{v}_k).$$

Each panel of $b$ columns is factored independently, and the trailing submatrix is updated as $(I - WTW^\top) A_{[:, b{:}n]}$ via two $\mathrm{gemm}$ calls. Flop count is unchanged but memory traffic drops by a factor of $O(b)$, yielding 5-10x speedups when the panel fits in L2.

sangi routes through MKL/OpenBLAS dgeqrf when available, and uses an internal WY implementation otherwise.

Givens Rotations and Fast Givens

A Givens rotation

$$G(i, k, \theta) = \begin{pmatrix} \cdots & & & \\ & c & \cdots & s & \\ & \vdots & \ddots & \vdots & \\ & -s & \cdots & c & \\ & & & \cdots \end{pmatrix}, \quad c = \cos\theta, \, s = \sin\theta,$$

acts on rows $i$ and $k$ only. Choosing $c = x_i / r$, $s = x_k / r$, $r = \sqrt{x_i^2 + x_k^2}$ produces $G^\top \mathbf{x} = (r, 0)^\top$, zeroing one entry. A dense $m \times n$ QR needs roughly $mn/2$ Givens rotations, about twice the flops of Householder.

Fast Givens (Square-Root Free)

The square root $\sqrt{x_i^2 + x_k^2}$ costs several cycles even with hardware support. Fast Givens carries an external diagonal scaling so that rotations are rational matrices $M = \mathrm{diag}(d_1, d_2)^{-1} \tilde{M}$, avoiding the square root at the cost of tracking and finally absorbing the scaling factors as $A \cdot D^{1/2}$.

When Givens Wins

Givens dominates Householder on structured matrices:

  • Hessenberg matrices: only the subdiagonal needs zeroing, completed in $n-1$ Givens steps. The inner loop of the QR algorithm runs here.
  • Banded matrices: rotations preserve the bandwidth, whereas Householder generally introduces fill-in.
  • Sparse matrices: fill-in can be minimized by working in a fixed sparsity pattern.
  • Updating QR: adding or removing one row admits Sherman-Morrison-style updates via a chain of Givens rotations.

sangi uses Givens for eigenvalue computation (QR algorithm post-Hessenberg) and sparse QR, and Householder for dense QR.

Classical vs Modified Gram-Schmidt

Gram-Schmidt orthogonalization builds an orthonormal basis column by column. Parallelism within a column is limited, but the column-by-column nature is exactly what Krylov methods (Arnoldi, GMRES) require.

Classical Gram-Schmidt (CGS)

All projections of $\mathbf{a}_k$ onto previous $\mathbf{q}_1, \ldots, \mathbf{q}_{k-1}$ are computed in one batch:

$$r_{jk} = \mathbf{q}_j^\top \mathbf{a}_k, \quad \tilde{\mathbf{q}}_k = \mathbf{a}_k - \sum_{j=1}^{k-1} r_{jk} \mathbf{q}_j, \quad r_{kk} = \|\tilde{\mathbf{q}}_k\|_2, \quad \mathbf{q}_k = \tilde{\mathbf{q}}_k / r_{kk}.$$

The $r_{jk}$ inner products are independent and parallelizable. The drawback: roundoff in $\mathbf{q}_j$ distorts the projection magnitudes, and the computed orthogonality satisfies $|\mathbf{q}_i^\top \mathbf{q}_j| = O(u \kappa^2(A))$. For $\kappa(A) > 1/\sqrt{u}$ orthogonality is lost entirely.

Modified Gram-Schmidt (MGS)

Projections are subtracted sequentially, with the latest partially orthogonalized vector used at each step:

v = a_k
for j = 1 ... k-1:
    r_jk = q_j^T v
    v = v - r_jk * q_j
r_kk = ||v||, q_k = v / r_kk

Mathematically identical to CGS, but the orthogonality error drops to $O(u \kappa(A))$ — one power of $\kappa$ better. Parallelism is reduced, but stability tied to $\kappa$ rather than $\kappa^2$ is decisive in practice.

sangi's Arnoldi iteration and GMRES use MGS, with optional reorthogonalization (a second MGS pass) reducing the error all the way to $O(u)$ for numerically stressed problems.

Economy QR vs Full QR

For $m \geq n$, two storage forms exist:

  • Full QR: $Q \in \mathbb{R}^{m \times m}$ orthogonal, $R \in \mathbb{R}^{m \times n}$ upper triangular (the bottom $m-n$ rows of $R$ are zero).
  • Economy (thin) QR: $Q_1 \in \mathbb{R}^{m \times n}$ with orthonormal columns, $R_1 \in \mathbb{R}^{n \times n}$ upper triangular; $Q_1$ is the first $n$ columns of the full $Q$.

Most applications only need $Q_1$:

  • Least squares $\min \|A\mathbf{x} - \mathbf{b}\|_2$ reduces to the triangular solve $R_1 \mathbf{x} = Q_1^\top \mathbf{b}$.
  • The projection onto the column space of $A$ is $Q_1 Q_1^\top$ (not $Q Q^\top = I$).
  • Low-rank approximations and Krylov bases are built from $Q_1$.

When $m \gg n$ (10,000 rows, 100 columns), storing the full $Q$ costs $O(m^2)$; thin QR needs only $O(mn)$. sangi's qr_decomposition returns full QR, but the least-squares solver uses an implicit representation of $Q$ and computes only $Q^\top \mathbf{b}$ to save memory.

Column-Pivoted QR (Rank Revealing)

When $A$ is rank-deficient or numerically low-rank, permuting columns during QR exposes the rank as a step in the magnitudes of $|R_{kk}|$. The classical strategy is Businger-Golub column pivoting.

The Businger-Golub Procedure

At stage $k$, compute the 2-norms of each remaining column,

$$c_j^{(k)} = \|A_{[k:m, j]}^{(k)}\|_2, \quad j = k, \ldots, n,$$

swap the column of maximum $c_p^{(k)}$ into position $k$, then apply a Householder reflection. Recording all swaps in a permutation $\Pi$, the final factorization is

$$A \Pi = Q R, \quad |R_{11}| \ge |R_{22}| \ge \cdots \ge |R_{nn}|.$$

The numerical rank is the smallest $k$ for which $|R_{kk}| < \tau |R_{11}|$ ($\tau$ a problem- or precision-dependent threshold).

Incremental Norm Updates

Recomputing all column norms each stage would cost $O(mn)$ per stage. Businger-Golub maintains the norms incrementally,

$$c_j^{(k+1)} = \sqrt{(c_j^{(k)})^2 - r_{kj}^2},$$

and falls back to recomputation when cancellation is suspected, keeping the leading constant low.

Applications

Column-pivoted QR provides SVD-quality numerical rank estimates in $O(mn^2)$ rather than $O(mn \cdot \min(m,n))$, and is the workhorse behind rank-deficient least squares, subset selection, low-rank approximation, and TLS approximation. It commonly appears as the deterministic stage of randomized SVD (see the SVD page).

Applications

Least Squares

For overdetermined $\min \|A\mathbf{x} - \mathbf{b}\|_2$ with $m > n$, QR yields

$$\|A\mathbf{x} - \mathbf{b}\|_2^2 = \|R_1 \mathbf{x} - (Q^\top \mathbf{b})_{[1:n]}\|_2^2 + \|(Q^\top \mathbf{b})_{[n+1:m]}\|_2^2.$$

The second term is independent of $\mathbf{x}$, so the minimizer satisfies the triangular system $R_1 \mathbf{x} = (Q^\top \mathbf{b})_{[1:n]}$. The condition number is $\kappa(A)$ rather than $\kappa(A^\top A) = \kappa(A)^2$ from the normal equations, making QR the default in sangi's least-squares solver.

Arnoldi Iteration

For nonsymmetric $A$, the orthonormal basis of the Krylov subspace $\mathcal{K}_k(A, \mathbf{v}) = \mathrm{span}(\mathbf{v}, A\mathbf{v}, \ldots, A^{k-1}\mathbf{v})$ is built by MGS-orthogonalizing each new $A \mathbf{q}_j$ against previous basis vectors. The byproduct is an upper Hessenberg matrix $H_k$ representing the projection of $A$ onto $\mathcal{K}_k$, the foundation of GMRES and Krylov eigenvalue solvers.

The QR Algorithm (Eigenvalues)

Iterating $A_0 = A$,

$$A_k = Q_k R_k, \quad A_{k+1} = R_k Q_k,$$

causes $A_k$ to converge to (real) Schur form. The practical implementation reduces $A$ to Hessenberg form first and applies Francis double-shift QR with implicit shifts; Givens rotations make each iteration $O(n^2)$. See the eigenvalue page for details.

sangi's Choices

SettingMethodNotes
Dense QRHouseholder + WYMKL dgeqrf when available; internal otherwise
QR algorithm post-HessenbergGivens + implicit shift$O(n^2)$ per sweep; see eigenvalue page
Arnoldi / GMRES orthogonalizationMGS (with optional reorthogonalization)Two-pass MGS for high $\kappa$
Least squaresHouseholder QR with implicit $Q$Computes only $Q^\top \mathbf{b}$
Sparse QRGivens + column ordering (AMD/COLAMD)See the sparse page

References

  • Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed., Johns Hopkins University Press. Chapter 5.
  • Householder, A. S. (1958). "Unitary triangularization of a nonsymmetric matrix". J. ACM, 5(4), 339–342.
  • Businger, P. and Golub, G. H. (1965). "Linear least squares solutions by Householder transformations". Numer. Math., 7, 269–276.
  • Schreiber, R. and Van Loan, C. F. (1989). "A storage-efficient WY representation for products of Householder transformations". SIAM J. Sci. Stat. Comput., 10(1), 53–57.
  • Björck, Å. (1996). Numerical Methods for Least Squares Problems, SIAM.