What is Strassen's three-real-multiplication trick?

The naive complex product (a+bi)(c+di) uses four real multiplications and two real additions. Three real multiplications and five real additions suffice; sangi uses the symmetric Karatsuba form Re = ac - bd, Im = (a+b)(c+d) - ac - bd, which is mathematically equivalent to Strassen's k1, k2, k3 identity. When real multiplication is much more expensive than real addition (multiprecision types like Int and Float), this is a net speedup.

Why does Smith's complex division avoid overflow?

The naive formula (a+bi)/(c+di) = ((ac+bd)+(bc-ad)i) / (c^2+d^2) can overflow when computing the intermediate c^2+d^2. Smith's method branches on whether |c| >= |d| or not and normalizes by the larger magnitude, keeping every intermediate value bounded by the input magnitudes.

Why does abs(z) call std::hypot?

sqrt(a*a + b*b) can overflow or underflow at the a*a or b*b stage even when sqrt(a^2+b^2) itself is representable. std::hypot normalizes a and b internally so that |z| = sqrt(a^2 + b^2) is always computed in the representable range. Multiprecision types use the equivalent normalization built from internal abs() / normSq() and sqrt().

How is polar(r, theta) implemented?

polar(r, theta) builds Complex from r * cos(theta) + i * r * sin(theta). For the very common case theta = 2 * pi * k / N (FFT twiddle factors), a dedicated expI(omega) function skips the multiplication by r = 1.

Complex Arithmetic and Representation

Overview

Complex<T> is implemented as a simple two-member structure holding the real part re and the imaginary part im directly. The template parameter T accepts any type satisfying the ComplexScalar concept, including double, float, Int, Float, and Rational.

Arithmetic dispatch depends on the characteristics of T. For multiprecision types, where a real multiplication is far more expensive than a real addition, Strassen's three-real-multiplication trick is used. For floating-point types, Smith's method guards complex division against intermediate overflow.

For API usage, see Complex API Reference — Arithmetic Operators.

Internal Representation and Memory Layout

The class body consists of two public member variables only.

template <ComplexScalar T>
struct Complex {
    T re;
    T im;
    // ...
};

There are no virtual functions or extra metadata, so $\operatorname{sizeof}(\texttt{Complex}\langle T\rangle) = 2 \operatorname{sizeof}(T)$. The following static_assert checks verify the layout at compile time:

sizeof(Complex<T>) == 2 * sizeof(T)
offsetof(Complex<T>, re) == 0
offsetof(Complex<T>, im) == sizeof(T)

This layout guarantee allows an array of Complex<double> to be reinterpreted as an array of std::complex<double> or C99 _Complex double via reinterpret_cast. This is the requirement for zero-copy interoperation with C libraries such as FFTW.

Addition and Subtraction

Addition and subtraction are simple component-wise operations.

$$(a+bi) \pm (c+di) = (a \pm c) + (b \pm d) i$$

Each operation issues two real additions (or subtractions) in T. For T = double, the compiler may fold these into a single SSE / AVX packed-double instruction. For multiprecision types (Int, Float), the T add/sub is invoked directly and runs in linear time $O(\max(\text{words}(a), \text{words}(c)))$ for each part.

Complexity: two real additions or subtractions.

Scalar mixed operations $z \pm s$ act on the real part only: $$(a+bi) \pm s = (a \pm s) + b i$$ The implementation takes the path z.re += s so that no copy or assignment is performed on the imaginary part's T object.

Multiplication — Strassen's Three-Real-Multiplication Trick

Naive Four-Multiplication Form

The textbook formula uses four real multiplications and two real additions:

$$(a+bi)(c+di) = (ac - bd) + (ad + bc) i$$

For T = double / float, sangi uses this naive form. Floating-point multiplication has comparable throughput to floating-point addition, so spending two extra additions to save one multiplication is rarely advantageous.

Karatsuba Symmetric Form (Three Real Multiplications)

For multiprecision types (Int, Float, Rational), a real multiplication can cost tens to hundreds of times more than a real addition. Trading additions for multiplications becomes a net win, and sangi applies Karatsuba's algorithm to the degree-one polynomial $a + b X$ with $X = i$, $X^2 = -1$ in the following symmetric three-multiplication form:

$$\begin{aligned} ac &= a \cdot c, \\ bd &= b \cdot d, \\ (a+bi)(c+di) &= (ac - bd) \;+\; \bigl((a+b)(c+d) - ac - bd\bigr)\,i. \end{aligned}$$

Three real multiplications ($ac$, $bd$, $(a+b)(c+d)$) and five real additions reproduce the product. This form is mathematically equivalent to Strassen's $k_1 = (a+b)c$, $k_2 = a(d-c)$, $k_3 = b(c+d)$ identity; sangi prefers the symmetric layout because additions distribute evenly across the real and imaginary parts.

Identity Verification

Expanding confirms equality:

$$\begin{aligned} \mathrm{Re} &= ac - bd, \\ \mathrm{Im} &= (a+b)(c+d) - ac - bd = ac + ad + bc + bd - ac - bd = ad + bc. \end{aligned}$$

Complexity: three real multiplications and five real additions. With $M(n)$ for multiplication and $A(n)$ for addition, the total is $3 M(n) + 5 A(n)$. Since $A(n) = O(n)$ and $M(n)$ is super-linear above the Karatsuba threshold, this is smaller than $4 M(n) + 2 A(n)$.

Division — Smith's Method

Naive Formula and Overflow

The textbook formula is:

$$\frac{a+bi}{c+di} = \frac{(ac + bd) + (bc - ad) i}{c^2 + d^2}$$

For T = double, computing $c^2 + d^2$ directly can overflow when $|c|$, $|d|$ are large, or underflow when they are small. For instance, $c = d = 10^{200}$ yields $c^2 + d^2 = 2 \times 10^{400}$, beyond the range of double.

Smith's Method

When T is a floating-point type, sangi uses Smith's (1962) method. By normalizing with the larger of $|c|$ and $|d|$, every intermediate value is kept within the input magnitudes.

If $|c| \geq |d|$:

$$r = d / c, \quad t = c + d \, r, \quad \mathrm{re} = (a + b r) / t, \quad \mathrm{im} = (b - a r) / t.$$

If $|c| < |d|$:

$$r = c / d, \quad t = d + c \, r, \quad \mathrm{re} = (a r + b) / t, \quad \mathrm{im} = (b r - a) / t.$$

In either branch, $|r| \leq 1$ and $t$ stays on the scale of $\max(|c|, |d|)$. The total cost is one branch, four divisions, three multiplications, and three additions or subtractions.

Multiprecision Types

For T = Int / Float / Rational, the notion of overflow differs. Int is arbitrary precision, and Float has a vast exponent range, so $c^2 + d^2$ can be computed safely. In this case sangi uses the naive formula with $\mathrm{denom} = c^2 + d^2$. The implementation calls denom = normSq(w) and constructs the real part $(a c + b d) / \mathrm{denom}$ and the imaginary part $(b c - a d) / \mathrm{denom}$.

If $c + di = 0$, the result is NaN; no exception is thrown and the program does not abort.

Smith, R. L. (1962). "Algorithm 116: Complex division". Communications of the ACM, 5 (8), 435.

Absolute Value and Norm — hypot Safety

normSq

The squared norm is computed straightforwardly:

$$|z|^2 = a^2 + b^2$$

This is a common building block (e.g. in FFT) and is faster than the absolute value because no square root is taken.

abs

The absolute value $|z| = \sqrt{a^2 + b^2}$, computed naively as sqrt(a*a + b*b), can overflow or underflow at the $a^2$ or $b^2$ stage.

For floating-point T, sangi calls std::hypot(a, b). This function is required by IEEE 754 to compute $|z|$ without intermediate overflow. Typical implementations normalize by $m = \max(|a|, |b|)$ and return $m \sqrt{(a/m)^2 + (b/m)^2}$.

For multiprecision T, sangi implements the same normalization strategy by combining multiprecision abs() / normSq() and sqrt(). Because Float can address an enormous exponent range, ordinary scientific computation rarely overflows, but routing the denominator $c^2 + d^2$ through the same square-root path keeps the code path consistent.

Argument and Conjugation

arg

The argument is computed via atan2:

$$\arg(z) = \operatorname{atan2}(b, a) \in (-\pi, \pi]$$

The principal value is returned, with $\arg(0) = 0$ following the atan2(0, 0) convention. The branch cut lies on the negative real axis ($a < 0$, $b = 0$). $\arg \to \pi$ as $b \to 0^+$ and $\arg \to -\pi$ as $b \to 0^-$.

conj

Conjugation $\bar{z} = a - bi$ simply flips the sign of the imaginary part — $O(1)$.

$$|z|^2 = z \cdot \bar{z}$$

could be used to compute the squared norm, but sangi prefers the symmetric $a^2 + b^2$ form because it avoids catastrophic cancellation between real-part and imaginary-part contributions.

Polar Form and expI

polar(r, θ)

Builds a complex number from polar coordinates:

$$\operatorname{polar}(r, \theta) = r \cos\theta + i \, r \sin\theta$$

The real part $r \cos\theta$ and the imaginary part $r \sin\theta$ are computed and a Complex<T> is constructed directly. For multiprecision types, T::sin and T::cos are invoked, triggering precision-controlled series expansion and range reduction internally.

expI(ω) — FFT Twiddle Factors

For the common case $e^{i\omega} = \cos\omega + i \sin\omega$ in FFT and NTT, a dedicated function avoids the two multiplications by $r = 1$ that polar would perform.

$$\operatorname{expI}(\omega) = e^{i\omega} = \cos\omega + i \sin\omega$$

The twiddle factors $W_N^k = e^{-2\pi i k/N}$ used in FFT are obtained as:

for (int k = 0; k < N; ++k) {
    auto W_k = expI(-2.0 * M_PI * k / N);  // k-th N-th root of unity
}

Euler's Formula

At $\omega = \pi$, expI(M_PI) ≈ -1 + 0i. Floating-point rounding leaves a small imaginary residue, but multiprecision types reach $-1$ with high precision because T::pi(precision) and the high-precision trigonometric functions cooperate.

Design Decisions

Why the Multiplication Form Differs Between Floating-Point and Multiprecision

In floating-point arithmetic, multiplication and addition have similar throughput, and modern x86 / ARM CPUs may even fuse multiply-add into a single FMA instruction. In this setting Strassen's trade-off (one fewer multiplication, three more additions) yields no net benefit and may even be slightly slower. Therefore T = double / float uses the naive four-multiplication form.

For multiprecision types, a real multiplication is $10^1$ to $10^3$ times more expensive than a real addition, and reducing the multiplication count dominates the cost balance. In that regime Strassen's three-multiplication form always wins.

Branch Cost in Smith's Method

Smith's division branches once on $|c|$ vs $|d|$. This branch is well predicted in practice (typical inputs are heavily skewed to one side), and the resulting pipeline impact on modern CPUs is negligible compared with the catastrophic cost of overflow in the alternative.

NaN Safety

$z / 0$ returns IEEE 754 NaN rather than throwing. This matches the behaviour of std::complex and Python cmath. Multiprecision types also propagate NaN, so a zero divisor anywhere in a computation chain does not abort the program.

References

Smith, R. L. (1962). "Algorithm 116: Complex division". Communications of the ACM, 5 (8), 435.
Karatsuba, A. and Ofman, Yu. (1962). "Multiplication of Multidigit Numbers on Automata". Doklady Akademii Nauk SSSR, 145, 293–294.
Knuth, D. E. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. 3rd ed. Addison-Wesley, 1997.