How does NTT differ from FFT?

FFT runs the same butterflies over the complex field and incurs floating-point rounding error. NTT replaces the complex unit root with a primitive root of unity inside a finite ring, so every operation stays in exact integer arithmetic and the convolution is bit-exact. NTT is the right choice when exactness matters, as in big-integer multiplication.

What is an NTT-friendly prime?

A prime p whose p − 1 contains a large power-of-two factor. Writing p − 1 = c · 2^k allows NTTs of length up to 2^k. The sangi primes prime1, prime2, and prime3 equal 30·2^47 + 1, 345·2^47 + 1, and 420·2^47 + 1 respectively, all close to 2^52 and supporting transforms up to length 2^47. Combining all three via CRT yields an effective coefficient width near 156 bits.

How does Schönhage-Strassen differ from a standard NTT?

Schönhage-Strassen runs the NTT over the ring Z/(B^F + 1) instead of a prime field. Inside this Fermat-like ring 2 is a (near-) primitive root, so every twiddle multiplication reduces to a bit shift. The sangi Int multiplication first switches into the Prime NTT path (three-prime NTT + CRT) at 1,250 words on Windows MSVC or 3,000 words on Linux gcc/clang, and falls back to Fermat NTT (Schönhage-Strassen) for the largest operands.

What is the point of 5-smooth NTT?

Classical NTT only runs fast at power-of-two sizes, so inputs have to be padded. A 5-smooth NTT (N = 2^a · 3^b · 5^c) needs less padding and reaches higher effective throughput. The sangi Ntt class handles 5-smooth sizes directly and falls back to Bluestein when the requested size is not 5-smooth.

What is Montgomery multiplication?

Montgomery (1985) showed how to compute a · b mod p with two integer multiplications and a shift instead of a division. sangi Ntt uses an AVX2 Montgomery NTT so that the modular multiplication on every butterfly stays off the critical path.

NTT and Big-Integer Multiplication

Overview

The Number Theoretic Transform (NTT) runs DFT-style butterflies over a finite ring — either a prime field $\mathbb{F}_p$ or the Fermat ring $\mathbb{Z}/(B^F + 1)$ — instead of the complex numbers. All arithmetic stays in integers, so the transform is bit-exact, which matters for big-integer arithmetic, lattice cryptography (NTRU, Kyber), and error-correcting codes.

This article ties together the two places where sangi uses NTT: the bundled Ntt class, and the Prime NTT (three-prime NTT + CRT) path that Int multiplication switches into above 1,250 words (MSVC) / 3,000 words (gcc/clang), with a Fermat NTT (Schönhage-Strassen) fallback for the largest operands.

NTT versus FFT

The radix-2 butterfly $X[k] = E[k] + \omega_N^k O[k]$ uses the unit root $\omega_N = e^{-2\pi i / N} \in \mathbb{C}$. Replacing $\omega_N$ with an order-$N$ primitive root of unity in a finite field $\mathbb{F}_p$ preserves the same butterfly structure, yielding the discrete Fourier transform of the finite-field-valued sequence.

$$\omega_N \in \mathbb{F}_p, \quad \omega_N^N = 1, \quad \omega_N^k \neq 1 \;(0 < k < N).$$

Because every operation is an integer modulo $p$, the output is exact. The price is the divisibility condition $N \mid (p - 1)$, which only some primes satisfy.

NTT-friendly primes

The maximum transform length is set by the power-of-two factor in $p - 1$. Primes of the form $p = c \cdot 2^k + 1$ (proth-like, with small $c$) are called NTT-friendly.

The sangi Ntt class exposes three such primes that fit inside 64-bit integers:

Function	Prime $p$	Primitive root $g$	Max NTT length
`prime1()`	$30 \times 2^{47} + 1$	19	$2^{47}$
`prime2()`	$345 \times 2^{47} + 1$	13	$2^{47}$
`prime3()`	$420 \times 2^{47} + 1$	17	$2^{47}$

All three are about $2^{52}$, so 64-bit multiplications fit in 128 bits. The cofactors $30 = 2 \cdot 3 \cdot 5$, $345 = 3 \cdot 5 \cdot 23$, and $420 = 2^2 \cdot 3 \cdot 5 \cdot 7$ keep the 5-smooth part of $p - 1$ large, so 5-smooth NTT applies directly.

If a single prime is too narrow for the convolution coefficients, run the NTT under all three primes and recombine via the Chinese remainder theorem (CRT). The effective bit width grows to roughly $3 \cdot 52 = 156$ bits.

Schönhage-Strassen and Fermat NTT

Schönhage and Strassen (1971) showed how to run the NTT over the ring $\mathbb{Z}/(B^F + 1)$ instead of a prime field. With $B = 2^{64}$ (the limb radix) and $F$ a splitting parameter, the value $2$ becomes (close to) an order-$2F$ primitive root.

$$\omega = 2 \in \mathbb{Z}/(B^F + 1), \qquad \omega^{2F} \equiv 1.$$

The crucial consequence is that the unit root is a bit shift: every twiddle multiplication is free. The constant in front of the $O(n \log n \log \log n)$ asymptotics is smaller than that of a regular NTT, making Schönhage-Strassen the fastest known integer multiplication for very large operands.

Integration into sangi Int

The sangi Int multiplication first dispatches to the Prime NTT path (three-prime NTT + CRT) at 1,250 words (about 24,000 digits) on Windows MSVC or 3,000 words on Linux gcc/clang, and falls back to the Fermat NTT (Schönhage-Strassen) for the largest operands. This section covers the latter.

Split the integer into $M = 2^l$ pieces of $K$ words each.
Embed each piece into $\mathbb{Z}/(B^F + 1)$ and use $\omega = B^{2F/M}$ as the NTT root.
Twiddle multiplications collapse to bit shifts.
The piecewise multiplications recurse into Karatsuba / Toom / smaller NTT.

The choice of $M$ and $F$ is optimised with a piecewise cost model; see Int multiplication — NTT for the details.

Polynomial and integer multiplication duality

The coefficients of the polynomial product of $a(x) = \sum a_i x^i$ and $b(x) = \sum b_j x^j$ form the linear convolution $c_k = \sum_{i+j=k} a_i b_j$. The same convolution computes the digit product of two integers in base $B$, modulo final carry propagation.

Multi-precision multiplication therefore decomposes into "polynomial multiplication + carry propagation". As long as the coefficients fit, the polynomial part can be evaluated bit-exactly by NTT, so integer and polynomial multiplication share the same computational core.

Polynomial multiplication (sangi): FFT::convolve, Ntt::polyMul.
Integer multiplication (sangi): Int::operator* Prime NTT branch (≥ 1,250 words on MSVC / ≥ 3,000 words on gcc/clang), with Fermat NTT fallback at very large operand sizes.
Polynomial mod $p$ multiplication: Ntt::polyMul (Montgomery inside).

5-smooth NTT

Classical NTT requires $N$ to be a power of two that divides $p - 1$. NTT-friendly primes whose cofactor contains small odd primes (3, 5) allow the same butterflies to run at $N = 2^a \cdot 3^b \cdot 5^c$ (5-smooth) using mixed radix-{2,3,5}.

The sangi Ntt::forward / Ntt::inverse handle 5-smooth sizes directly. Non-5-smooth sizes go through Bluestein, which reduces back to a 5-smooth convolution. Less padding means higher average throughput.

Ntt::nextSmoothSize(n) returns the smallest 5-smooth size at least $n$, and Ntt::isSmooth(n) exposes the predicate.

Montgomery modular multiplication

Every butterfly performs a multiplication modulo $p$. A direct implementation needs a 64×64 → 128-bit multiplication plus a 128/64-bit division for reduction, and the division dominates.

Montgomery (1985) reshapes the multiplication. Values are kept in Montgomery form $\tilde{a} = a \cdot R \bmod p$ with $R = 2^{64}$, and products reduce via

$$\mathrm{Reduce}(t) = \frac{t + ((t \cdot p') \bmod R) \cdot p}{R} \bmod p, \qquad p' = -p^{-1} \bmod R.$$

Division becomes a shift and the modulo becomes a low-word extract, so the whole reduction costs two integer multiplications and a shift. AVX2 widens the integer multiplications to four-way SIMD, and splitting the high and low halves separately keeps the butterfly throughput-bound. The combined effect is roughly a 5x speedup on the inner loop.

sangi Ntt handles the conversion to and from Montgomery form automatically; clients call polyMul with ordinary integers.

sangi Ntt class

Quick summary of the public API (full reference in FFT API — Ntt).

Function	Purpose
`Ntt::forward(data, N, prime)`	Forward NTT (in place)
`Ntt::inverse(data, N, prime)`	Inverse NTT including $1/N$ scaling
`Ntt::pointwiseMul(r, a, b, N, prime)`	Elementwise multiplication in NTT space
`Ntt::polyMul(r, a, an, b, bn, prime)`	Polynomial multiplication mod $p$ (high-level)
`Ntt::nextSmoothSize(n)`	Next 5-smooth size $\geq n$
`Ntt::isSmooth(n)`	5-smooth predicate
`Ntt::isValidSize(N, prime)`	Checks $N \mid (p - 1)$

When to use what

Polynomial multiplication mod $p$ → Ntt::polyMul (does padding and Montgomery internally).
Forward / pointwise / inverse manually → call forward, pointwiseMul, inverse.
Large integer products → use the Int operators, which switch to Prime NTT automatically and fall back to Schönhage-Strassen at very large sizes.

Design decisions

Three primes

A single prime overflows when the convolution coefficients grow large. Running three NTT-friendly primes in parallel and recombining with CRT extends the effective coefficient width to roughly 156 bits. The sangi prime1/2/3 are sized to share the same $2^{47}$ maximum NTT length for this use case.

FFT class `ntt` / `intt` are deprecated

A legacy ntt, intt, and convolve_ntt live under FFT<ModularInt<P>> for backward compatibility, but new code should call the Ntt class.

Why NTT for integers instead of complex FFT

Complex FFT applied to big-integer multiplication starts to break down once the product of the coefficient bit width and $\log_2 N$ exceeds the 53 mantissa bits of double: rounding error then corrupts the result. NTT computes in a finite ring, so accuracy is guaranteed by the algorithm and no result verification is needed.

References

Pollard, J. M. (1971). "The Fast Fourier Transform in a Finite Field". Mathematics of Computation, 25, 365–374.
Schönhage, A. and Strassen, V. (1971). "Schnelle Multiplikation großer Zahlen". Computing, 7, 281–292.
Montgomery, P. L. (1985). "Modular Multiplication Without Trial Division". Mathematics of Computation, 44, 519–521.
Crandall, R. and Fagin, B. (1994). "Discrete weighted transforms and large-integer arithmetic". Mathematics of Computation, 62, 305–324.
Harvey, D. (2014). "Faster arithmetic for number-theoretic transforms". Journal of Symbolic Computation, 60, 113–119.