Vector Arithmetic Algorithms

Overview

The sangi classes Vector<T> and StaticVector<T,N> represent numerical vectors of element type $T$ and evaluate expressions through SIMD packet evaluation backed by expression templates. This article addresses the implementation perspective: internal layout, error management, vectorization, and asymptotic behavior.

For usage details, see the Vector API Reference.

Internal Representation

Dynamic Vector<T>

Vector<T> stores its elements in a contiguous one-dimensional array backed by std::vector<T>. The contiguous layout directly benefits from hardware prefetchers and SIMD load/store instructions.

Element stride: always $\mathrm{sizeof}(T)$ bytes (packed layout)
Alignment: determined by the standard allocator. SIMD loads use unaligned variants (_mm_loadu_*), so no alignment requirement is imposed on the user
Size management: size() and capacity() are separated; reserve permits pre-allocation

Static StaticVector<T,N>

StaticVector<T,N> stores its elements in a std::array<T,N>. No heap allocation occurs, which makes it a strictly faster choice for small vectors such as 3D coordinates or quaternions. Since $N$ is fixed at compile time, inner loops are typically fully unrolled.

Both classes inherit from the CRTP base VecExpr<Derived> and behave identically within expression-template evaluations.

Elementwise Arithmetic

Addition, subtraction, and scalar multiplication/division are all $O(n)$ elementwise operations. Without expression templates the statement c = a + b would form an intermediate a + b and then assign it to c in a second pass. sangi evaluates such expressions in a single loop and single pass.

$$c_i = \alpha a_i + \beta b_i - \gamma d_i \quad (i = 0, \ldots, n-1)$$

For the composite assignment c = alpha*a + beta*b - gamma*d, sangi loads one SIMD packet from each of a, b, d, combines them in registers, and stores the resulting packet into c. Because such code is memory-bandwidth bound, the speedup over the naive two-pass version is typically 4-5x.

BLAS Level-1 compatible fused routines axpy ($\mathbf{y} \mathrel{+}= \alpha \mathbf{x}$) and axpby ($\mathbf{y} = \alpha \mathbf{x} + \beta \mathbf{y}$) are likewise implemented as single loops in a form favorable to compiler auto-vectorization.

Dot Product

The dot product $\langle \mathbf{a}, \mathbf{b} \rangle = \sum_{i=0}^{n-1} a_i b_i$ is one of the most frequent reduction operations. A naive implementation suffers from loop-carried dependency that stalls the floating-point pipeline, so sangi adopts multi-accumulator unrolling.

$$s = \sum_{k=0}^{3} \left( \sum_{i \equiv k \pmod 4} a_i b_i \right)$$

Four independent SIMD accumulators run in parallel, and a horizontal reduction merges them at the end. Modern x86 FMA has latency 4-5 cycles and throughput 0.5 cycles, requiring at least four independent accumulators to saturate the unit. The sangi implementation reaches this theoretical limit.

Choice Regarding Kahan Correction

The sangi dot uses plain accumulation without Kahan correction. The rationale:

BLAS ddot is also uncorrected, so plain accumulation matches the de facto standard
If $\|\mathbf{a}\| \cdot \|\mathbf{b}\|$ would cause cancellation, the calling code should restructure the computation algorithmically
Kahan correction is 3-4x slower than plain accumulation, and paying that cost universally would not be appropriate for a general-purpose dot

Error analysis: the relative error of plain accumulation is $O(n \epsilon)$, while four-accumulator unrolling brings it down to roughly $O(\sqrt{n} \cdot \epsilon)$ ($\epsilon$ is machine epsilon). Users who require strict precision can lift $\mathbf{a}$ and $\mathbf{b}$ to Float (sangi arbitrary-precision floating point) and call dot there for an arbitrary-precision inner product.

Cross Product

The cross product is specific to three-dimensional vector spaces. sangi exposes it as a free function on StaticVector<T,3> only. It is not provided for dynamic Vector<T>: in linear algebra, an API that signals dimension mismatch through runtime exceptions tends to invite misuse.

$$\mathbf{a} \times \mathbf{b} = \begin{pmatrix} a_1 b_2 - a_2 b_1 \\ a_2 b_0 - a_0 b_2 \\ a_0 b_1 - a_1 b_0 \end{pmatrix}$$

SIMD vectorization offers little benefit (only three components) but the operation can be performed with one shuffle and two FMA instructions. The implementation is a fully unrolled three-line scalar form.

Seven-dimensional cross product (derived from octonions) and higher-dimensional Hodge duals are not provided. Use the wedge product through Matrix for such cases.

L1 / L2 / L-infinity Norms

All three norms are implemented internally as reductions, eligible for SIMD packet evaluation.

Function	Definition	Complexity
`norm_l1`	$\\|\mathbf{v}\\|_1 = \sum_i \|v_i\|$	$O(n)$
`norm` (= L2)	$\\|\mathbf{v}\\|_2 = \sqrt{\sum_i v_i^2}$	$O(n)$
`norm_linf`	$\\|\mathbf{v}\\|_\infty = \max_i \|v_i\|$	$O(n)$
`norm_lp`	$\\|\mathbf{v}\\|_p = (\sum_i \|v_i\|^p)^{1/p}$	$O(n)$

Overflow / Underflow Handling for the L2 Norm

A naive evaluation of $\sum v_i^2$ overflows when $|v_i|$ is large and underflows to zero when $|v_i|$ is small, even when $\|\mathbf{v}\|$ itself is representable. The classical remedy is the hypot-style scaling:

$$\|\mathbf{v}\| = m \cdot \sqrt{\sum_i (v_i / m)^2}, \quad m = \max_i |v_i|$$

This requires two passes (first the maximum, then the scaled sum) and roughly doubles the cost. sangi norm uses the naive implementation by default; when overflow resistance is required, the user should use the dedicated safe_norm alias (planned) or hypot directly. For typical workloads where $n$ is large and $|v_i| \approx 1$, the naive form is sufficient, and paying the double-pass cost universally is undesirable.

SIMD Reduction for the L-infinity Norm (T = double / float only)

When $T$ is double or float, $\max_i |v_i|$ vectorizes via lane-wise _mm256_max_pd-style instructions. Four accumulators are reduced in parallel, then merged with a horizontal max. The absolute value is computed by masking off the sign bit (and with $0x7\mathrm{FF}\ldots$), avoiding any branching. This relies on the IEEE 754 bit layout and does not apply to multi-precision types (Int / Float / Rational) or Complex<T>, which fall back to a scalar loop that calls abs() on each element (their internal representation is variable-width and does not fit in a 256-bit lane).

Normalization and Projection

Normalize

normalized(v) returns $\mathbf{v} / \|\mathbf{v}\|$. When $\|\mathbf{v}\| = 0$, sangi returns the zero vector rather than throwing, because intermediate zero vectors appear frequently in numerical pipelines and exception propagation would complicate control flow unnecessarily.

The in-place variant v.normalize() computes norm once and then multiplies the entire vector by its reciprocal, so the division is performed only once.

Projection

Projecting $\mathbf{a}$ onto the direction of $\mathbf{b}$:

$$\mathrm{proj}_{\mathbf{b}}(\mathbf{a}) = \frac{\langle \mathbf{a}, \mathbf{b} \rangle}{\langle \mathbf{b}, \mathbf{b} \rangle} \mathbf{b}$$

sangi does not provide a dedicated function. Instead the expression (dot(a,b) / dot(b,b)) * b goes through the expression-template machinery, producing no intermediate Vector and folding into a single scaled loop. Gram-Schmidt orthogonalization and Householder reflector construction make heavy internal use of this form.

When $\langle \mathbf{b}, \mathbf{b} \rangle$ is small, the quotient becomes numerically unstable. A more stable variant maintains a normalized unit vector $\hat{\mathbf{b}}$ and projects via $\langle \mathbf{a}, \hat{\mathbf{b}} \rangle \hat{\mathbf{b}}$. The QR path through Householder reflectors uses this latter formulation.

VectorMap and Strided Views

VectorMap<T> and ConstVectorMap<T> are zero-copy views over external memory regions. They allow a C array, a std::vector, or a Python numpy.ndarray buffer to participate in sangi expression templates without taking ownership.

Handling Strides

VectorMap supports strided access. For example, reading a column of an $M \times N$ matrix as a vector yields a stride of $M$ words between elements. A strided view behaves as follows:

No SIMD packet loads — elements are not contiguous, so the operation degenerates to a gather instruction or sequential loads
Participates in expression templates — accessed through the node's addressing function, leaving optimization to the compiler
Semantically equivalent to a copy — results match those of a packed copy, but performance is lower than contiguous access

A stride-1 view (contiguous access) receives the same optimization as a packed buffer. head, tail, and segment return such stride-1 VectorViews.

SIMD and Expression Templates

PacketTraits<T>

SIMD packet width is abstracted by PacketTraits<T>. Specializations currently exist for AVX2 and SSE2: for double the width is 4 (AVX2) / 2 (SSE2); for float it is 8 (AVX2) / 4 (SSE2); for Float (arbitrary precision) or Int it is 1 (no packet, scalar evaluation). For a vector of $n$ elements with packet width $W$, evaluation comprises $\lfloor n/W \rfloor$ packet iterations and a scalar tail of $n \bmod W$ elements. An AVX-512 specialization of PacketTraits is not yet implemented (future work), but the BLAS-like free functions (dot / axpy / scal in simd_backend) take a separate path that compiles to AVX-512 intrinsics when available.

The auto Pitfall

Because of expression templates, the type of auto v = a + b is not Vector<T> but an intermediate node such as VecBinOp<Plus, Vector<T>, Vector<T>>. That node holds references to a and b, so leaving the scope of $\mathbf{a}$ or $\mathbf{b}$ creates a dangling reference.

As the API reference warns, when passing a computed expression to another function or holding it across calls, bind it to a concrete type (Vector<T> v = a + b;) or use an explicit eval() to materialize.

Boundary Conditions in Packet Evaluation

Since SIMD load/store instructions are issued in their unaligned form, the user need not be concerned with memory alignment. Two boundary regimes are handled:

$n \geq W$: packet loop followed by scalar tail
$n < W$: scalar evaluation only (avoiding packet setup overhead)

When $n$ is extremely small (e.g., $n = 1$) and the call frequency is high, the cost of packet setup dominates parallel benefit, so the code dispatches to the scalar path.

Design Decisions

Why Dynamic and Static Are Not Unified

Eigen-style single template Matrix<T, Rows, Cols> covers dynamic and static cases with one type, but the header expansion is large and compile times suffer. sangi separates the dynamic Vector<T> and static StaticVector<T,N> into distinct classes so readers can follow overload resolution easily. Both inherit from the common CRTP base VecExpr, so free functions (dot, norm, etc.) are written once and shared.

Cross Is Restricted to 3D

The cross product on $\mathbb{R}^3$ arises from the imaginary part of the quaternion algebra, an algebraic structure with no natural extension to other dimensions (except a special case in seven dimensions). The proper generalization is the Hodge dual paired with the wedge product, which is treated through Matrix. Exposing cross(a,b) on dynamic Vector<T> with a runtime size check would invite misuse, so it is omitted.

Why Kahan Correction Is Not the Default

The plain dot matches the BLAS standard, and with four-accumulator unrolling the empirical error of $O(\sqrt{n}\epsilon)$ is sufficient for most applications. Kahan correction is 3-4x slower. For applications that genuinely require maximum-precision inner products, promoting the element type to Float (sangi arbitrary-precision floating point) is a cleaner approach than re-implementing Kahan with SIMD.

References

Higham, N. J. Accuracy and Stability of Numerical Algorithms. 2nd ed. SIAM, 2002. (Floating-point dot product error analysis)
Goldberg, D. (1991). "What Every Computer Scientist Should Know About Floating-Point Arithmetic". ACM Computing Surveys, 23 (1), 5–48.
Veldhuizen, T. (1995). "Expression Templates". C++ Report, 7 (5), 26–31.