Householder reflections or Givens rotations: which is better?

Householder reflections dominate for dense triangularization; their flop count is roughly half that of dense Givens. Givens rotations zero a single entry at a time and are preferred when the matrix is already close to triangular — Hessenberg, banded, or sparse — and only a few entries need to be eliminated. sangi uses Householder for dense QR and Givens for the QR algorithm, post-Hessenberg tridiagonalization, and sparse QR.

What is the difference between classical and modified Gram-Schmidt?

Classical Gram-Schmidt (CGS) subtracts all projections at once, so roundoff in the q_j accumulates and orthogonality is lost at rate O(u kappa^2(A)). Modified Gram-Schmidt (MGS) removes projections sequentially, refreshing the working vector each time, and limits the loss of orthogonality to O(u kappa(A)) — one order better. sangi's Arnoldi iteration uses MGS, optionally with one reorthogonalization pass for tough cases.

What is the WY representation?

A way to write the product of k Householder reflections H_1 H_2 ... H_k as a single I - W T W^T, so it can be applied to a trailing matrix with one Level-3 BLAS gemm call instead of k Level-2 calls. sangi's blocked QR uses WY to apply panel reflections to the trailing submatrix, dramatically improving cache efficiency.

When is column-pivoted QR useful?

For rank determination and rank-deficient least squares. The Businger-Golub strategy selects, at each step, the remaining column with the largest 2-norm as the next pivot. The resulting AP=QR has |R_kk| nearly monotonically decreasing, and the index where |R_kk| drops sharply reveals the numerical rank. It is much cheaper than SVD for rank estimation and underpins randomized SVD and low-rank approximation pipelines.

Economy QR vs full QR — what's the difference?

For an m-by-n matrix A with m >= n, full QR returns Q in R^(m x m) and R in R^(m x n). Economy (thin) QR returns Q1 in R^(m x n) and R1 in R^(n x n). Only the first n columns of Q are needed for least squares, normal equations, and low-rank approximation, so thin QR saves memory and flops when m is much larger than n. sangi's qr_decomposition returns full QR, but internally the least-squares solver uses an implicit representation of Q to avoid forming the full Q.

QR Factorization — Householder, Givens, MGS, Column Pivoting

Overview

QR factorization $A = QR$ writes an $m \times n$ matrix ($m \geq n$) as the product of an orthogonal $Q \in \mathbb{R}^{m \times m}$ and an upper-triangular $R \in \mathbb{R}^{m \times n}$. Orthogonal transformations preserve the condition number and propagate roundoff minimally, making QR one of the most stable factorizations in numerical linear algebra.

Three classical algorithms construct it, each with its own niche:

Householder reflections — bulk triangularization with minimum flops on dense matrices
Givens rotations — local single-entry zeroing, ideal for sparse, banded, or Hessenberg structure
Gram-Schmidt orthogonalization — column-by-column, essential to Krylov methods like Arnoldi and GMRES

This page covers the mathematical structure of each, their stability behavior, BLAS-3 blocking via the WY representation, rank-revealing column pivoting, and applications to least squares and eigenvalue computation.

For API usage, see LinAlg API — QR.

Householder Reflections

A Householder reflection

$$H = I - 2 \frac{\mathbf{v} \mathbf{v}^\top}{\mathbf{v}^\top \mathbf{v}}$$

is an orthogonal symmetric matrix representing reflection through the hyperplane perpendicular to $\mathbf{v}$. Choosing $\mathbf{v} = \mathbf{x} \pm \|\mathbf{x}\|_2 \mathbf{e}_1$ yields $H\mathbf{x} = \mp \|\mathbf{x}\|_2 \mathbf{e}_1$, aligning $\mathbf{x}$ with the first axis in one shot. The sign is taken as $\mathrm{sign}(x_1)$ to avoid cancellation when forming $\mathbf{v}$ (Wilkinson's sign rule).

The triangularization algorithm is:

Form a reflector $H_1$ for the first column $\mathbf{a}_1$ and apply $H_1 A$ to zero its subdiagonal entries
Recurse on the trailing $(m-1) \times (n-1)$ submatrix with $H_2, H_3, \ldots, H_n$
The composition $H_n \cdots H_2 H_1 A = R$ is upper triangular; $Q = H_1 H_2 \cdots H_n$ (each $H_i$ is symmetric)

Flop count is $2mn^2 - \tfrac{2}{3}n^3$, asymptotically optimal for dense QR. $Q$ is typically not formed explicitly; instead the reflection vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ are stored in the zeroed subdiagonal of $R$, and $Q\mathbf{x}$ or $Q^\top \mathbf{b}$ is evaluated on demand via matrix-vector products. Memory drops from $O(m^2)$ to $O(mn)$.

Blocked Householder and the WY Representation

Applying $H_k$ one at a time uses Level-2 BLAS (matrix-vector products). The WY representation aggregates $b$ consecutive reflections into a single Level-3 BLAS multiplication.

Write the cumulative product as

$$H_1 H_2 \cdots H_k = I - W_k T_k W_k^\top,$$

where $W_k \in \mathbb{R}^{m \times k}$ stacks the reflection vectors and $T_k \in \mathbb{R}^{k \times k}$ is upper triangular, updated by

$$T_k = \begin{pmatrix} T_{k-1} & -\tau_k T_{k-1} W_{k-1}^\top \mathbf{v}_k \\ 0 & \tau_k \end{pmatrix}, \quad \tau_k = 2 / (\mathbf{v}_k^\top \mathbf{v}_k).$$

Each panel of $b$ columns is factored independently, and the trailing submatrix is updated as $(I - WTW^\top) A_{[:, b{:}n]}$ via two $\mathrm{gemm}$ calls. Flop count is unchanged but memory traffic drops by a factor of $O(b)$, yielding 5-10x speedups when the panel fits in L2.

sangi routes through MKL/OpenBLAS dgeqrf when available, and uses an internal WY implementation otherwise.

Givens Rotations and Fast Givens

A Givens rotation

$$G(i, k, \theta) = \begin{pmatrix} \cdots & & & \\ & c & \cdots & s & \\ & \vdots & \ddots & \vdots & \\ & -s & \cdots & c & \\ & & & \cdots \end{pmatrix}, \quad c = \cos\theta, \, s = \sin\theta,$$

acts on rows $i$ and $k$ only. Choosing $c = x_i / r$, $s = x_k / r$, $r = \sqrt{x_i^2 + x_k^2}$ produces $G^\top \mathbf{x} = (r, 0)^\top$, zeroing one entry. A dense $m \times n$ QR needs roughly $mn/2$ Givens rotations, about twice the flops of Householder.

Fast Givens (Square-Root Free)

The square root $\sqrt{x_i^2 + x_k^2}$ costs several cycles even with hardware support. Fast Givens carries an external diagonal scaling so that rotations are rational matrices $M = \mathrm{diag}(d_1, d_2)^{-1} \tilde{M}$, avoiding the square root at the cost of tracking and finally absorbing the scaling factors as $A \cdot D^{1/2}$.

When Givens Wins

Givens dominates Householder on structured matrices:

Hessenberg matrices: only the subdiagonal needs zeroing, completed in $n-1$ Givens steps. The inner loop of the QR algorithm runs here.
Banded matrices: rotations preserve the bandwidth, whereas Householder generally introduces fill-in.
Sparse matrices: fill-in can be minimized by working in a fixed sparsity pattern.
Updating QR: adding or removing one row admits Sherman-Morrison-style updates via a chain of Givens rotations.

sangi uses Givens for eigenvalue computation (QR algorithm post-Hessenberg) and sparse QR, and Householder for dense QR.

Classical vs Modified Gram-Schmidt

Gram-Schmidt orthogonalization builds an orthonormal basis column by column. Parallelism within a column is limited, but the column-by-column nature is exactly what Krylov methods (Arnoldi, GMRES) require.

Classical Gram-Schmidt (CGS)

All projections of $\mathbf{a}_k$ onto previous $\mathbf{q}_1, \ldots, \mathbf{q}_{k-1}$ are computed in one batch:

$$r_{jk} = \mathbf{q}_j^\top \mathbf{a}_k, \quad \tilde{\mathbf{q}}_k = \mathbf{a}_k - \sum_{j=1}^{k-1} r_{jk} \mathbf{q}_j, \quad r_{kk} = \|\tilde{\mathbf{q}}_k\|_2, \quad \mathbf{q}_k = \tilde{\mathbf{q}}_k / r_{kk}.$$

The $r_{jk}$ inner products are independent and parallelizable. The drawback: roundoff in $\mathbf{q}_j$ distorts the projection magnitudes, and the computed orthogonality satisfies $|\mathbf{q}_i^\top \mathbf{q}_j| = O(u \kappa^2(A))$. For $\kappa(A) > 1/\sqrt{u}$ orthogonality is lost entirely.

Modified Gram-Schmidt (MGS)

Projections are subtracted sequentially, with the latest partially orthogonalized vector used at each step:

v = a_k
for j = 1 ... k-1:
    r_jk = q_j^T v
    v = v - r_jk * q_j
r_kk = ||v||, q_k = v / r_kk

Mathematically identical to CGS, but the orthogonality error drops to $O(u \kappa(A))$ — one power of $\kappa$ better. Parallelism is reduced, but stability tied to $\kappa$ rather than $\kappa^2$ is decisive in practice.

sangi's Arnoldi iteration and GMRES use MGS, with optional reorthogonalization (a second MGS pass) reducing the error all the way to $O(u)$ for numerically stressed problems.

Economy QR vs Full QR

For $m \geq n$, two storage forms exist:

Full QR: $Q \in \mathbb{R}^{m \times m}$ orthogonal, $R \in \mathbb{R}^{m \times n}$ upper triangular (the bottom $m-n$ rows of $R$ are zero).
Economy (thin) QR: $Q_1 \in \mathbb{R}^{m \times n}$ with orthonormal columns, $R_1 \in \mathbb{R}^{n \times n}$ upper triangular; $Q_1$ is the first $n$ columns of the full $Q$.

Most applications only need $Q_1$:

Least squares $\min \|A\mathbf{x} - \mathbf{b}\|_2$ reduces to the triangular solve $R_1 \mathbf{x} = Q_1^\top \mathbf{b}$.
The projection onto the column space of $A$ is $Q_1 Q_1^\top$ (not $Q Q^\top = I$).
Low-rank approximations and Krylov bases are built from $Q_1$.

When $m \gg n$ (10,000 rows, 100 columns), storing the full $Q$ costs $O(m^2)$; thin QR needs only $O(mn)$. sangi's qr_decomposition returns full QR, but the least-squares solver uses an implicit representation of $Q$ and computes only $Q^\top \mathbf{b}$ to save memory.

Column-Pivoted QR (Rank Revealing)

When $A$ is rank-deficient or numerically low-rank, permuting columns during QR exposes the rank as a step in the magnitudes of $|R_{kk}|$. The classical strategy is Businger-Golub column pivoting.

The Businger-Golub Procedure

At stage $k$, compute the 2-norms of each remaining column,

$$c_j^{(k)} = \|A_{[k:m, j]}^{(k)}\|_2, \quad j = k, \ldots, n,$$

swap the column of maximum $c_p^{(k)}$ into position $k$, then apply a Householder reflection. Recording all swaps in a permutation $\Pi$, the final factorization is

$$A \Pi = Q R, \quad |R_{11}| \ge |R_{22}| \ge \cdots \ge |R_{nn}|.$$

The numerical rank is the smallest $k$ for which $|R_{kk}| < \tau |R_{11}|$ ($\tau$ a problem- or precision-dependent threshold).

Incremental Norm Updates

Recomputing all column norms each stage would cost $O(mn)$ per stage. Businger-Golub maintains the norms incrementally,

$$c_j^{(k+1)} = \sqrt{(c_j^{(k)})^2 - r_{kj}^2},$$

and falls back to recomputation when cancellation is suspected, keeping the leading constant low.

Applications

Column-pivoted QR provides SVD-quality numerical rank estimates in $O(mn^2)$ rather than $O(mn \cdot \min(m,n))$, and is the workhorse behind rank-deficient least squares, subset selection, low-rank approximation, and TLS approximation. It commonly appears as the deterministic stage of randomized SVD (see the SVD page).

Applications

Least Squares

For overdetermined $\min \|A\mathbf{x} - \mathbf{b}\|_2$ with $m > n$, QR yields

$$\|A\mathbf{x} - \mathbf{b}\|_2^2 = \|R_1 \mathbf{x} - (Q^\top \mathbf{b})_{[1:n]}\|_2^2 + \|(Q^\top \mathbf{b})_{[n+1:m]}\|_2^2.$$

The second term is independent of $\mathbf{x}$, so the minimizer satisfies the triangular system $R_1 \mathbf{x} = (Q^\top \mathbf{b})_{[1:n]}$. The condition number is $\kappa(A)$ rather than $\kappa(A^\top A) = \kappa(A)^2$ from the normal equations, making QR the default in sangi's least-squares solver.

Arnoldi Iteration

For nonsymmetric $A$, the orthonormal basis of the Krylov subspace $\mathcal{K}_k(A, \mathbf{v}) = \mathrm{span}(\mathbf{v}, A\mathbf{v}, \ldots, A^{k-1}\mathbf{v})$ is built by MGS-orthogonalizing each new $A \mathbf{q}_j$ against previous basis vectors. The byproduct is an upper Hessenberg matrix $H_k$ representing the projection of $A$ onto $\mathcal{K}_k$, the foundation of GMRES and Krylov eigenvalue solvers.

The QR Algorithm (Eigenvalues)

Iterating $A_0 = A$,

$$A_k = Q_k R_k, \quad A_{k+1} = R_k Q_k,$$

causes $A_k$ to converge to (real) Schur form. The practical implementation reduces $A$ to Hessenberg form first and applies Francis double-shift QR with implicit shifts; Givens rotations make each iteration $O(n^2)$. See the eigenvalue page for details.

sangi's Choices

Setting	Method	Notes
Dense QR	Householder + WY	MKL `dgeqrf` when available; internal otherwise
QR algorithm post-Hessenberg	Givens + implicit shift	$O(n^2)$ per sweep; see eigenvalue page
Arnoldi / GMRES orthogonalization	MGS (with optional reorthogonalization)	Two-pass MGS for high $\kappa$
Least squares	Householder QR with implicit $Q$	Computes only $Q^\top \mathbf{b}$
Sparse QR	Givens + column ordering (AMD/COLAMD)	See the sparse page

References

Golub, G. H. and Van Loan, C. F. (2013). Matrix Computations, 4th ed., Johns Hopkins University Press. Chapter 5.
Householder, A. S. (1958). "Unitary triangularization of a nonsymmetric matrix". J. ACM, 5(4), 339–342.
Businger, P. and Golub, G. H. (1965). "Linear least squares solutions by Householder transformations". Numer. Math., 7, 269–276.
Schreiber, R. and Van Loan, C. F. (1989). "A storage-efficient WY representation for products of Householder transformations". SIAM J. Sci. Stat. Comput., 10(1), 53–57.
Björck, Å. (1996). Numerical Methods for Least Squares Problems, SIAM.