NTT and Big-Integer Multiplication
Overview
The Number Theoretic Transform (NTT) runs DFT-style butterflies over a finite ring — either a prime field $\mathbb{F}_p$ or the Fermat ring $\mathbb{Z}/(B^F + 1)$ — instead of the complex numbers. All arithmetic stays in integers, so the transform is bit-exact, which matters for big-integer arithmetic, lattice cryptography (NTRU, Kyber), and error-correcting codes.
This article ties together the two places where sangi uses NTT: the bundled
Ntt class, and the Prime NTT (three-prime NTT + CRT) path that
Int multiplication switches into above 1,250 words (MSVC) / 3,000 words (gcc/clang), with a Fermat NTT (Schönhage-Strassen) fallback for the largest operands.
NTT versus FFT
The radix-2 butterfly $X[k] = E[k] + \omega_N^k O[k]$ uses the unit root $\omega_N = e^{-2\pi i / N} \in \mathbb{C}$. Replacing $\omega_N$ with an order-$N$ primitive root of unity in a finite field $\mathbb{F}_p$ preserves the same butterfly structure, yielding the discrete Fourier transform of the finite-field-valued sequence.
$$\omega_N \in \mathbb{F}_p, \quad \omega_N^N = 1, \quad \omega_N^k \neq 1 \;(0 < k < N).$$
Because every operation is an integer modulo $p$, the output is exact. The price is the divisibility condition $N \mid (p - 1)$, which only some primes satisfy.
See also: Number theoretic transform
NTT-friendly primes
The maximum transform length is set by the power-of-two factor in $p - 1$. Primes of the form $p = c \cdot 2^k + 1$ (proth-like, with small $c$) are called NTT-friendly.
The sangi Ntt class exposes three such primes that fit inside 64-bit integers:
| Function | Prime $p$ | Primitive root $g$ | Max NTT length |
|---|---|---|---|
prime1() | $30 \times 2^{47} + 1$ | 19 | $2^{47}$ |
prime2() | $345 \times 2^{47} + 1$ | 13 | $2^{47}$ |
prime3() | $420 \times 2^{47} + 1$ | 17 | $2^{47}$ |
All three are about $2^{52}$, so 64-bit multiplications fit in 128 bits. The cofactors $30 = 2 \cdot 3 \cdot 5$, $345 = 3 \cdot 5 \cdot 23$, and $420 = 2^2 \cdot 3 \cdot 5 \cdot 7$ keep the 5-smooth part of $p - 1$ large, so 5-smooth NTT applies directly.
If a single prime is too narrow for the convolution coefficients, run the NTT under all three primes and recombine via the Chinese remainder theorem (CRT). The effective bit width grows to roughly $3 \cdot 52 = 156$ bits.
Schönhage-Strassen and Fermat NTT
Schönhage and Strassen (1971) showed how to run the NTT over the ring $\mathbb{Z}/(B^F + 1)$ instead of a prime field. With $B = 2^{64}$ (the limb radix) and $F$ a splitting parameter, the value $2$ becomes (close to) an order-$2F$ primitive root.
$$\omega = 2 \in \mathbb{Z}/(B^F + 1), \qquad \omega^{2F} \equiv 1.$$
The crucial consequence is that the unit root is a bit shift: every twiddle multiplication is free. The constant in front of the $O(n \log n \log \log n)$ asymptotics is smaller than that of a regular NTT, making Schönhage-Strassen the fastest known integer multiplication for very large operands.
Integration into sangi Int
The sangi Int multiplication first dispatches to the Prime NTT path (three-prime NTT + CRT) at 1,250 words (about 24,000 digits) on Windows MSVC or 3,000 words on Linux gcc/clang, and falls back to the Fermat NTT (Schönhage-Strassen) for the largest operands. This section covers the latter.
- Split the integer into $M = 2^l$ pieces of $K$ words each.
- Embed each piece into $\mathbb{Z}/(B^F + 1)$ and use $\omega = B^{2F/M}$ as the NTT root.
- Twiddle multiplications collapse to bit shifts.
- The piecewise multiplications recurse into Karatsuba / Toom / smaller NTT.
The choice of $M$ and $F$ is optimised with a piecewise cost model; see Int multiplication — NTT for the details.
Polynomial and integer multiplication duality
The coefficients of the polynomial product of $a(x) = \sum a_i x^i$ and $b(x) = \sum b_j x^j$ form the linear convolution $c_k = \sum_{i+j=k} a_i b_j$. The same convolution computes the digit product of two integers in base $B$, modulo final carry propagation.
Multi-precision multiplication therefore decomposes into "polynomial multiplication + carry propagation". As long as the coefficients fit, the polynomial part can be evaluated bit-exactly by NTT, so integer and polynomial multiplication share the same computational core.
- Polynomial multiplication (sangi):
FFT::convolve,Ntt::polyMul. - Integer multiplication (sangi):
Int::operator*Prime NTT branch (≥ 1,250 words on MSVC / ≥ 3,000 words on gcc/clang), with Fermat NTT fallback at very large operand sizes. - Polynomial mod $p$ multiplication:
Ntt::polyMul(Montgomery inside).
5-smooth NTT
Classical NTT requires $N$ to be a power of two that divides $p - 1$. NTT-friendly primes whose cofactor contains small odd primes (3, 5) allow the same butterflies to run at $N = 2^a \cdot 3^b \cdot 5^c$ (5-smooth) using mixed radix-{2,3,5}.
The sangi Ntt::forward / Ntt::inverse handle 5-smooth sizes directly. Non-5-smooth
sizes go through Bluestein, which reduces back to a 5-smooth convolution. Less padding means higher average
throughput.
Ntt::nextSmoothSize(n) returns the smallest 5-smooth size at least $n$, and
Ntt::isSmooth(n) exposes the predicate.
Montgomery modular multiplication
Every butterfly performs a multiplication modulo $p$. A direct implementation needs a 64×64 → 128-bit multiplication plus a 128/64-bit division for reduction, and the division dominates.
Montgomery (1985) reshapes the multiplication. Values are kept in Montgomery form $\tilde{a} = a \cdot R \bmod p$ with $R = 2^{64}$, and products reduce via
$$\mathrm{Reduce}(t) = \frac{t + ((t \cdot p') \bmod R) \cdot p}{R} \bmod p, \qquad p' = -p^{-1} \bmod R.$$
Division becomes a shift and the modulo becomes a low-word extract, so the whole reduction costs two integer multiplications and a shift. AVX2 widens the integer multiplications to four-way SIMD, and splitting the high and low halves separately keeps the butterfly throughput-bound. The combined effect is roughly a 5x speedup on the inner loop.
sangi Ntt handles the conversion to and from Montgomery form automatically;
clients call polyMul with ordinary integers.
sangi Ntt class
Quick summary of the public API (full reference in FFT API — Ntt).
| Function | Purpose |
|---|---|
Ntt::forward(data, N, prime) | Forward NTT (in place) |
Ntt::inverse(data, N, prime) | Inverse NTT including $1/N$ scaling |
Ntt::pointwiseMul(r, a, b, N, prime) | Elementwise multiplication in NTT space |
Ntt::polyMul(r, a, an, b, bn, prime) | Polynomial multiplication mod $p$ (high-level) |
Ntt::nextSmoothSize(n) | Next 5-smooth size $\geq n$ |
Ntt::isSmooth(n) | 5-smooth predicate |
Ntt::isValidSize(N, prime) | Checks $N \mid (p - 1)$ |
When to use what
- Polynomial multiplication mod $p$ →
Ntt::polyMul(does padding and Montgomery internally). - Forward / pointwise / inverse manually → call
forward,pointwiseMul,inverse. - Large integer products → use the
Intoperators, which switch to Prime NTT automatically and fall back to Schönhage-Strassen at very large sizes.
Design decisions
Three primes
A single prime overflows when the convolution coefficients grow large. Running three NTT-friendly primes in
parallel and recombining with CRT extends the effective coefficient width to roughly 156 bits. The sangi
prime1/2/3 are sized to share the same $2^{47}$ maximum NTT length for this use case.
FFT class ntt / intt are deprecated
A legacy ntt, intt, and convolve_ntt live under
FFT<ModularInt<P>> for backward compatibility, but new code should call the
Ntt class.
Why NTT for integers instead of complex FFT
Complex FFT applied to big-integer multiplication starts to break down once the product of the coefficient
bit width and $\log_2 N$ exceeds the 53 mantissa bits of double: rounding error then corrupts the
result. NTT computes in a finite ring, so accuracy is guaranteed by the algorithm and no result verification
is needed.
References
- Pollard, J. M. (1971). "The Fast Fourier Transform in a Finite Field". Mathematics of Computation, 25, 365–374.
- Schönhage, A. and Strassen, V. (1971). "Schnelle Multiplikation großer Zahlen". Computing, 7, 281–292.
- Montgomery, P. L. (1985). "Modular Multiplication Without Trial Division". Mathematics of Computation, 44, 519–521.
- Crandall, R. and Fagin, B. (1994). "Discrete weighted transforms and large-integer arithmetic". Mathematics of Computation, 62, 305–324.
- Harvey, D. (2014). "Faster arithmetic for number-theoretic transforms". Journal of Symbolic Computation, 60, 113–119.