Number-Theoretic Transform (NTT)

Q: What is the difference between NTT and FFT?

FFT uses the complex root of unity e^{-j 2π/N}, whereas NTT uses a primitive N-th root of unity in the finite field Z/pZ. NTT works entirely with integer modular arithmetic, so round-off error is zero. The Cooley-Tukey divide-and-conquer structure is identical.

Q: What is an NTT-friendly prime?

A prime p such that p-1 has a large power-of-two factor. A standard example is p = 998244353 = 119 · 2^23 + 1, which supports NTTs up to length 2^23. For CRYSTALS-Kyber, p = 3329 = 13 · 2^8 + 1 is used with N = 256.

A Discrete Fourier Transform over Finite Fields

Advanced

Published: 2026-03-29

1. Definition and History

The Number-Theoretic Transform (NTT) is a discrete Fourier transform defined over a finite field $\mathbb{Z}/p\mathbb{Z}$ (with $p$ prime) instead of the complex field $\mathbb{C}$. It preserves the mathematical structure of the FFT while operating entirely on integer modular arithmetic, so round-off error never arises.

Historical Timeline

Year	Researcher	Contribution
1971	Pollard	FFT construction over a finite field — the mathematical prototype of the NTT
1971	Schönhage & Strassen	Applied NTT to multiprecision integer multiplication, achieving $O(n \log n \log \log n)$
1990s	GMP, Magma, etc.	Computer algebra systems adopted NTT as the standard for high-end integer multiplication
2007	Lyubashevsky, Peikert, Regev	Ring-LWE problem — NTT accelerates polynomial-ring arithmetic for lattice cryptography
2019	Harvey & van der Hoeven	True $O(n \log n)$ integer multiplication algorithm (theoretical end-point)
2024	NIST	CRYSTALS-Kyber and Dilithium standardised for PQC — NTT becomes the cryptographic core

Contrast with FFT

The FFT uses the complex twiddle factor $e^{-j 2\pi / N}$ and therefore suffers from unavoidable floating-point round-off error. The NTT operates purely on integer modular arithmetic, so no round-off error is introduced. This is the decisive advantage that makes NTT indispensable for multiprecision integer multiplication and lattice-based cryptography.

Correspondence between FFT and NTT

Figure 1: NTT inherits the algorithmic structure of the FFT (Cooley-Tukey divide-and-conquer, butterfly operations) and only swaps the underlying field from the complex numbers to a finite field. Only the choice of $\omega$ differs.

2. Primitive Roots and Primitive $N$-th Roots

Primitive Root

For a prime $p$, the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^* = \{1, 2, \ldots, p-1\}$ is a cyclic group of order $p - 1$. An element $g$ that generates this group is called a primitive root:

$$\{g^0, g^1, g^2, \ldots, g^{p-2}\} = \{1, 2, \ldots, p-1\} \pmod p$$

The order of $g$ equals $p - 1$ (no smaller positive integer $k$ satisfies $g^k \equiv 1 \pmod p$).

Existence and Count

Existence: a primitive root exists for every prime $p$ (Gauss, 1801)
Number: $\varphi(p - 1)$ primitive roots, where $\varphi$ is Euler's totient
Algorithm: try $g = 2, 3, 5, \ldots$; for each prime factor $q$ of $p - 1$ verify that $g^{(p-1)/q} \not\equiv 1 \pmod p$. A primitive root is typically found within $O(\log p)$ trials

Primitive $N$-th Roots

When $N \mid (p - 1)$, the element $\omega = g^{(p-1)/N} \bmod p$ is a primitive $N$-th root of unity in $\mathbb{Z}/p\mathbb{Z}$:

$\omega^N \equiv 1 \pmod p$ ($N$-th root of unity)
$\omega^k \not\equiv 1 \pmod p$ for $0 < k < N$ (primitivity)
$\omega^{N/2} \equiv -1 \pmod p$ (the "halving" property used by the FFT butterfly)

This $\omega$ is the NTT analogue of $e^{-j 2\pi / N}$. It satisfies the same FFT properties:

Periodicity: $\omega^{k + N} = \omega^k$
Symmetry: $\omega^{k + N/2} = -\omega^k$
Halving: $\omega^2$ is a primitive $(N/2)$-th root of unity

These properties license the Cooley-Tukey divide-and-conquer structure verbatim.

Example: $p = 17, N = 4$

$p - 1 = 16$. A primitive root is $g = 3$ (verify: $3^1 = 3, 3^2 = 9, 3^4 = 81 \equiv 13, 3^8 \equiv 16 \equiv -1$, so the order is 16). For $N = 4$, $\omega = 3^{16/4} = 3^4 \equiv 13 \pmod{17}$.

Check: $13^2 = 169 \equiv 169 - 9 \cdot 17 = 16 \equiv -1 \pmod{17}$, and $13^4 \equiv (-1)^2 = 1 \pmod{17}$. ✓

3. NTT Formulation and Its Correspondence with FFT

Transform Definition

The NTT of an integer sequence $\{a[n]\}_{n=0}^{N-1}$ (with $a[n] \in \mathbb{Z}/p\mathbb{Z}$) is:

$$A[k] = \displaystyle\sum_{n=0}^{N-1} a[n] \cdot \omega^{nk} \bmod p, \quad k = 0, 1, \ldots, N-1$$

The inverse transform (INTT) is:

$$a[n] = N^{-1} \displaystyle\sum_{k=0}^{N-1} A[k] \cdot \omega^{-nk} \bmod p$$

Here $N^{-1}$ is the modular inverse of $N$ (computed as $N^{p-2} \bmod p$ via Fermat's little theorem), and $\omega^{-1} = \omega^{N-1} \bmod p$.

Matrix Form

The NTT is a linear transform and admits a matrix representation:

$$\mathbf{A} = \Omega_N \, \mathbf{a}, \quad (\Omega_N)_{k,n} = \omega^{kn} \bmod p$$

$\Omega_N$ is a symmetric Vandermonde-type matrix whose inverse is $\Omega_N^{-1} = N^{-1} \cdot \overline{\Omega_N}$ (with $(\overline{\Omega_N})_{k,n} = \omega^{-kn}$) — exactly mirroring the FFT.

Convolution Theorem

NTT Convolution Theorem

For the cyclic convolution $c[n] = \sum_k a[k] b[(n-k) \bmod N] \bmod p$ over $\mathbb{Z}_p$:

$$\mathrm{NTT}(c) = \mathrm{NTT}(a) \odot \mathrm{NTT}(b) \pmod p$$

(where $\odot$ denotes element-wise product). This is isomorphic to the FFT convolution theorem and underpins all fast-convolution applications of NTT.

Side-by-Side Properties

Property	FFT	NTT
Linearity	$\mathrm{F}(\alpha a + \beta b) = \alpha \mathrm{F}(a) + \beta \mathrm{F}(b)$	Same (mod $p$)
Shift	$\mathrm{F}(a[(n-m) \bmod N]) = \omega^{mk} A[k]$	Same (mod $p$)
Convolution theorem	$\mathrm{F}(a * b) = A \odot B$	Same (mod $p$)
Parseval	$\sum \|a[n]\|^2 = \tfrac{1}{N} \sum \|A[k]\|^2$	$\sum a[n]^2 \equiv N^{-1} \sum A[k] A[-k] \pmod p$

4. NTT-Friendly Primes

Running an NTT of length $N = 2^m$ requires $2^m \mid (p - 1)$ (so that $\omega = g^{(p-1)/N}$ is well-defined as an integer power). A prime $p$ in which $p - 1$ contains a large power-of-two factor is called NTT-friendly:

$$p = c \cdot 2^a + 1, \quad c \text{ small odd}, \quad a \text{ large}$$

Major NTT-Friendly Primes

Prime $p$	Factorisation of $p - 1$	Max $N$	Primitive root $g$	Typical use
$998244353$	$119 \cdot 2^{23}$	$2^{23} \approx 8.4 \cdot 10^6$	$3$	Competitive programming standard
$469762049$	$7 \cdot 2^{26}$	$2^{26} \approx 6.7 \cdot 10^7$	$3$	Arbitrary-mod convolution
$167772161$	$5 \cdot 2^{25}$	$2^{25} \approx 3.4 \cdot 10^7$	$3$	Arbitrary-mod convolution
$754974721$	$45 \cdot 2^{24}$	$2^{24} \approx 1.7 \cdot 10^7$	$11$	Auxiliary in 3-prime CRT
$2013265921$	$15 \cdot 2^{27}$	$2^{27} \approx 1.3 \cdot 10^8$	$31$	Large-scale convolution
$2281701377$	$17 \cdot 2^{27}$	$2^{27}$	$3$	Same as above
$3221225473$	$3 \cdot 2^{30}$	$2^{30} \approx 1.1 \cdot 10^9$	$5$	30-bit large-scale
$3329$	$2^8 \cdot 13$	$2^8 = 256$	$3$	CRYSTALS-Kyber
$8380417$	$2^{13} \cdot 1023$	$2^{13}$	$10$	CRYSTALS-Dilithium

Bit Width versus Use Case

Bit width of $p$	Representative	Use case
$\sim 12$ bit	$3329$ (Kyber)	Lattice cryptography, small coefficients
$\sim 23$ bit	$8380417$ (Dilithium)	Lattice signatures
$\sim 30$ bit	$998244353$	Fits 32-bit arithmetic; competitive programming
$\sim 31$ bit	$2013265921$, $2281701377$	Pair-up to recover 62-bit values
$\sim 62$ bit	Solinas primes, $2^{61} - 1$	Bigint multiplication using a single prime

2-Adic Valuation of $p - 1$

The exponent $a$ such that $p - 1 = c \cdot 2^a$ with $c$ odd is the 2-adic valuation $\nu_2(p - 1) = a$. This single number determines the maximum NTT length $N_{\max} = 2^a$. Searching for NTT-friendly primes amounts to generating primes with large $\nu_2(p - 1)$.

5. Fast Modular Arithmetic (Montgomery / Barrett)

The inner loop of an NTT is dominated by "multiply plus reduce mod $p$". A naive (a * b) % p is slow because integer division is slow on modern CPUs. Production implementations therefore replace divisions entirely.

Montgomery Multiplication

Introduced by Peter L. Montgomery in 1985. Choose $R = 2^k$ with $p < R$ (typically $k = 32$ or $64$) and keep every value in Montgomery form $\tilde{a} = a R \bmod p$. The Montgomery product is:

$$\mathrm{MontMul}(\tilde{a}, \tilde{b}) = \tilde{a} \tilde{b} R^{-1} \bmod p$$

executed without division (Montgomery reduction). Only a single shift by $\log_2 R$ replaces the division.

Barrett Reduction

Introduced by Paul Barrett in 1986. Pre-compute $\mu = \lfloor 2^{2k} / p \rfloor$ once; then for any $0 \le x < p^2$:

$$x \bmod p \approx x - \lfloor x \mu / 2^{2k} \rfloor \cdot p$$

The result is $x \bmod p$ or $x \bmod p + p$, so a final conditional subtraction is required. Unlike Montgomery, no pre/post conversion of inputs is needed.

Comparison

Item	Montgomery	Barrett
Input conversion	Required ($a \to aR$)	Not required
Output conversion	Required ($aR^{-1} \to a$)	Not required
Cost per multiply	2 mul + 1 add/sub	3 mul + 1 add/sub
Conditional branches	1 ($p$ subtraction check)	1-2
NTT inner loop	Montgomery wins (multiplications dominate)	Conversion overhead never amortised
One-off modular reduction	Conversion overhead	Barrett wins

Solinas Primes (Special-Form Primes)

Primes of the form $p = 2^a - c$ (small $c$) or $p = 2^a + 2^b - 1$ admit specialised modular reduction (Solinas, 1999):

$$x \bmod (2^a - c) \approx (x \bmod 2^a) + c \lfloor x / 2^a \rfloor$$

Combined with Montgomery in the NTT hot loop, this can eliminate all division-like instructions. Mersenne primes $2^{31} - 1$ and $2^{61} - 1$ are special cases of Solinas primes.

6. Algorithm

The NTT algorithm has exactly the same structure as the Cooley-Tukey FFT — only the complex multiplications are replaced by modular ones. An in-place iterative (decimation-in-time) version:

function ntt(a, N, p, omega):
    // Step 1: bit-reversal permutation
    a = bit_reverse_copy(a, N)
    // Step 2: layered butterflies
    for s = 1 to log2(N):
        m = 2^s
        wm = power(omega, N/m, p)        // m-th root = omega^{N/m}
        for k = 0 to N-1 step m:
            w = 1
            for r = 0 to m/2-1:
                t = (w * a[k + r + m/2]) mod p
                u = a[k + r]
                a[k + r]       = (u + t) mod p
                a[k + r + m/2] = (u - t + p) mod p
                w = (w * wm) mod p
    return a

Cost per Pass

Number of butterflies: $\tfrac{N}{2} \log_2 N$ (identical to FFT)
Per butterfly: 1 multiply + 2 add/sub (all mod $p$)
Total: $O(N \log N)$ modular multiplications

Montgomery-Form NTT

In practice every value is held in Montgomery form, and the inner-loop w * a[...] is executed via Montgomery multiplication. Powers of $\omega$ are pre-computed in Montgomery form:

// Pre-compute Montgomery-form roots
omega_mont = (omega * R) mod p
wm_table[s] = (omega^{N/(2^s)} * R) mod p  for s = 1..log2(N)

// Hot inner loop (no division anywhere)
for r = 0 to m/2-1:
    t = MontMul(w, a[k+r+m/2])
    a[k+r+m/2] = SubMod(a[k+r], t)
    a[k+r]     = AddMod(a[k+r], t)
    w = MontMul(w, wm_table[s])

Parallelism

SIMD: AVX-512 VPMULQ with Montgomery reduction supports eight-way parallel modular multiplication
GPU: warp-level implementations are published in cuFHE and elsewhere
Multi-core: the outer loops over butterfly stages parallelise cleanly with $O(\log N)$ synchronisation barriers

The inverse NTT is obtained by replacing $\omega$ with $\omega^{-1}$ and multiplying every output by $N^{-1} \bmod p$ — perfectly symmetric to the forward transform.

7. Negacyclic NTT and $\mathbb{Z}_p[x]/(x^N+1)$

The standard NTT computes a cyclic convolution (modulo $x^N - 1$). Lattice cryptography (Ring-LWE) instead requires an anti-cyclic (negacyclic) convolution modulo $x^N + 1$. The negacyclic NTT computes this directly.

The Anti-Cyclic Convolution

In the polynomial ring $\mathbb{Z}_p[x]/(x^N + 1)$ we have $x^N \equiv -1$, so shifting coefficients flips signs:

$$x^N \cdot a_n = -a_n$$

The ordinary cyclic convolution is therefore unsuitable for ring multiplication.

Construction of the Negacyclic NTT

Take a primitive $2N$-th root of unity $\psi$ (which requires $2N \mid (p - 1)$, so $\psi^N \equiv -1 \pmod p$) and "twist" the coefficients before transforming:

$$\tilde{a}_n = \psi^n \cdot a_n$$

Apply the standard length-$N$ NTT to $\tilde{a}$. After pointwise multiplication, an inverse twist $\psi^{-n}$ recovers the product in $\mathbb{Z}_p[x]/(x^N + 1)$.

The twist and untwist cost $O(N)$ each — negligible compared to the $O(N \log N)$ transform.

Implementation in CRYSTALS-Kyber

Kyber operates with $p = 3329$ and $N = 256$. Since $2N = 512 \mid 3328 = 2^8 \cdot 13$, the negacyclic NTT applies directly. Choosing $\psi$ as a primitive 512-th root of unity, all arithmetic in $\mathbb{Z}_{3329}[x]/(x^{256} + 1)$ runs through the NTT.

8. Polynomial Multiplication and CRT Composition

To compute the product $h(x) = f(x) \cdot g(x)$ of polynomials of degrees $n$ and $m$, the result has degree $n + m$ and $n + m + 1$ coefficients.

Linear-Convolution Polynomial Product via NTT

Choose a power of two $N \ge n + m + 1$
Zero-pad the coefficient arrays of $f, g$ to length $N$ (linear convolution = cyclic convolution after padding)
Compute $F = \mathrm{NTT}(f)$ and $G = \mathrm{NTT}(g)$
Pointwise: $H[k] = F[k] \cdot G[k] \bmod p$ ($O(N)$)
Recover the product coefficients $h = \mathrm{INTT}(H)$

The complexity is $O(N \log N)$ versus the direct cost $O(nm)$ — a dramatic speed-up for large polynomials.

When Coefficients Exceed the Modulus: CRT Composition

If the product coefficients $c_i$ can exceed $p$ — so that $c_i \bmod p$ collides — perform the NTT product in parallel under several NTT-friendly primes $p_1, p_2, p_3$ and recombine with the Chinese Remainder Theorem (CRT):

Run NTT products under $p_1 = 998244353$, $p_2 = 985661441$, $p_3 = 754974721$ (or similar)
For each coefficient $c_i$, obtain $c_i \bmod p_j$ ($j = 1, 2, 3$)
Recover $c_i \bmod (p_1 p_2 p_3)$ via the Garner algorithm (incremental CRT)

Capacity of the 3-Prime CRT

$p_1 p_2 p_3 \approx 7.6 \times 10^{26} \approx 2^{89}$ — about 89-bit coefficients can be recovered exactly. Even with $N = 2^{20}$ inputs each of magnitude $V = 2^{32}$, the worst-case coefficient bound $c_i \le N V^2$ fits comfortably.

Garner Algorithm

Efficient 3-prime CRT recovery:

// Input: x mod p1, x mod p2, x mod p3
// Output: x mod (p1*p2*p3)
inv_12 = modinv(p1, p2)
inv_123 = modinv(p1*p2 mod p3, p3)
v1 = x_mod_p1
v2 = (x_mod_p2 - v1) * inv_12 mod p2
v3 = ((x_mod_p3 - v1) - v2*p1) * inv_123 mod p3
return v1 + v2*p1 + v3*p1*p2

Implementation in sangi

The sangi library implements a 5-smooth NTT on sizes $N = 2^a 3^b 5^c$ with radix-3 and radix-5 butterflies. The set of usable sizes is much denser than the standard $N = 2^k$ alone, so the NTT engages for a wider class of inputs. The construction follows the mixed-radix Cooley-Tukey idea applied to NTT.

9. Multiprecision Integer Multiplication (Schönhage-Strassen)

To multiply $d$-digit integers $A, B$ with NTT:

Represent each integer as a coefficient sequence in base $B$ (e.g.\ $B = 2^{16}$ or $B = 10^4$), giving polynomials of length $L = \lceil d / \log_B \rceil$
NTT-multiply the polynomials to obtain the "digit" coefficients of $A \cdot B$ (which may exceed $B$)
Carry propagation: split each $c_i$ into $c_i \bmod B$ plus a carry $\lfloor c_i / B \rfloor$ into the next position

The complexity is $O(d \log d \log \log d)$ (Schönhage-Strassen), or $O(d \log d \cdot 2^{O(\log^* d)})$ (Fürer 2007), and finally $O(d \log d)$ (Harvey-van der Hoeven 2019). This is asymptotically faster than schoolbook $O(d^2)$, Karatsuba $O(d^{1.585})$, and Toom-Cook $O(d^{1.465})$.

The Schönhage-Strassen Algorithm

The classical 1971 NTT-based integer-multiplication algorithm:

Split: divide $A, B$ into $K$ blocks, treating them as polynomials of length $L = 2K$
Work modulo $2^M + 1$: each coefficient is $M$ bits wide, with $M = O(d^{1/2})$ a typical choice
Length-$L$ NTT: $\omega = 2^{M/L}$ becomes a primitive $L$-th root of unity
Coefficient products: products of $M$-bit numbers are computed recursively by the same algorithm ($\log \log d$ levels of recursion)
Inverse NTT plus carry propagation

NTT modulo $2^M + 1$

The Schönhage-Strassen variant works in the residue ring $\mathbb{Z}/(2^M + 1)\mathbb{Z}$ instead of $\mathbb{Z}_p$. $2$ becomes a primitive $2M$-th root of unity ($2^{2M} = (2^M)^2 \equiv (-1)^2 = 1$, plus primitivity), so $\omega = 2^{2M/L}$ multiplications reduce to bit shifts. This is the algorithm's signature speed-up.

Modern Library Implementations

Library	Multiplication strategy	NTT role
GMP	Karatsuba → Toom-3/4/6/8 → FFT-style (NTT)	Engaged above ~10⁴ digits
FLINT	NTT-centric (from medium sizes upward)	Both polynomials and integers
NTL	SS-Modular for big integers	Standard in cryptographic research
sangi	Karatsuba → Toom-3/4/6/8 → 5-smooth NTT	Engaged at $N \ge 1200$ words

Real implementations follow a hybrid strategy: schoolbook/Karatsuba for small inputs, Toom-Cook for medium, NTT for large. Crossover thresholds depend on hardware and implementation.

10. Lattice-Based and Post-Quantum Cryptography

The NIST-standardised post-quantum cryptography (PQC) schemes CRYSTALS-Kyber and CRYSTALS-Dilithium (2024) both rely on NTT for fast polynomial-ring arithmetic.

Ring-LWE and Polynomial Arithmetic

The Ring Learning With Errors (RLWE) problem is set in the polynomial ring $R = \mathbb{Z}_p[x]/(x^N + 1)$:

$$b = a \cdot s + e \pmod{p, x^N + 1}$$

Recovering the secret $s$ is computationally hard. With $a, s, e \in R$ as polynomials of degree $N - 1$ for fairly large $N$ ($256, 512, 1024$), the product $a \cdot s$ dominates the cost and is sped up to $O(N \log N)$ by the NTT.

Major PQC Schemes

Scheme	$p$	$N$	Purpose	Standardisation
CRYSTALS-Kyber (ML-KEM)	$3329$	$256$	Key encapsulation	NIST FIPS 203 (2024)
CRYSTALS-Dilithium (ML-DSA)	$8380417$	$256$	Signature	NIST FIPS 204 (2024)
Falcon	$12289$	$512, 1024$	Compact signature	NIST FIPS 206 (in progress)
NewHope (legacy)	$12289$	$1024$	Key exchange	Absorbed into Kyber

NTT in Fully Homomorphic Encryption (FHE)

BFV, BGV, and CKKS — all based on the same polynomial-ring structure — use NTT in encryption, decryption, and homomorphic multiplication. Standard libraries (Microsoft SEAL, IBM HElib, Lattigo) implement NTT pervasively.

Side-Channel Resistance and Constant-Time Implementation

NTT for cryptographic use must be implemented in constant time. Data-dependent branches or memory accesses leak secret values through timing attacks. The final conditional subtraction in Montgomery reduction is replaced by a masked branchless form (always subtract, then mask-select the right value).

11. Competitive Programming

"Convolution mod $p$" is a recurring competitive-programming theme, and the NTT is an essential technique.

Typical Problems

Polynomial multiplication: product of degree-$10^5$ polynomials mod $998244353$
Convolution counting: $c_k = \sum_{i+j=k} a_i b_j$
Subset-sum counts: generating functions for combinatorial enumeration
Formal power series: $\exp, \log, \sqrt{}, \mathrm{inv}$ truncated mod $x^n$ via Newton iteration + NTT

Arbitrary-Modulus Convolution

Problems often demand convolution under a modulus other than $998244353$ (e.g.\ $10^9 + 7$). The standard recipe is the three-prime NTT: run independent NTTs under $p_1, p_2, p_3$ and apply Garner to reconstruct the target. Total cost is roughly $3$–$4 \times$ a single NTT.

AtCoder Library Convolution

The AtCoder Library (ACL) C++ template atcoder::convolution<Mod> ships an NTT-based convolution that picks the optimal strategy automatically. It is fastest at $\mathrm{Mod} = 998244353$ but supports other moduli too.

12. NTT vs FFT Comparison

Item	FFT (complex)	NTT (integer)
Underlying field	$\mathbb{C}$ (floating-point)	$\mathbb{Z}/p\mathbb{Z}$ (finite field)
Twiddle factor	$e^{-j 2\pi / N}$	$g^{(p-1)/N} \bmod p$
Error	$O(\sqrt{N} \, \epsilon_{\mathrm{mach}})$	Exact (0)
Cost per multiply	4 fp mul + 2 fp add	1 int mul + mod reduce
SIMD parallelism	Excellent (FP units)	Good (integer units)
Constraint on $N$	Power-of-two is best	Requires $N \mid (p - 1)$
Upper bound on $N$	Practically unbounded	Limited by $\nu_2(p - 1)$
Coefficient domain	Continuous reals	$\{0, 1, \ldots, p - 1\}$
Primary use cases	Signal processing, PDEs, imaging	Bigint, cryptography, competitive programming
Hardware	FP SIMD (AVX/SSE)	Integer SIMD, PQC ASICs
Reference libraries	FFTW, cuFFT, MKL	FLINT, NTL, ACL

When to Choose Which

Continuous values, approximation acceptable: FFT (signal processing, PDE solvers)
Exact integer results required: NTT (bigint, cryptography, competitive programming)
Arbitrary $N$ with no 2-adic constraint: FFT plus Bluestein fall-back
Fixed $N$, speed paramount: NTT (Kyber/Dilithium)

13. Worked Examples

Example 1: Minimal NTT with $p = 5$, $N = 4$

$p = 5$, primitive root $g = 2$, $\omega = g^{(5-1)/4} = 2^1 = 2$ (verify $2^4 = 16 \equiv 1 \pmod 5$).

Transforming $a = [1, 2, 3, 4]$:

$$A[0] = (1 + 2 + 3 + 4) \bmod 5 = 10 \bmod 5 = 0$$ $$A[1] = (1 + 2 \cdot 2 + 3 \cdot 4 + 4 \cdot 3) \bmod 5 = 29 \bmod 5 = 4$$ $$A[2] = (1 + 2 \cdot 4 + 3 \cdot 1 + 4 \cdot 4) \bmod 5 = 28 \bmod 5 = 3$$ $$A[3] = (1 + 2 \cdot 3 + 3 \cdot 4 + 4 \cdot 2) \bmod 5 = 27 \bmod 5 = 2$$

Result: $A = [0, 4, 3, 2]$. Every operation is integer modular arithmetic — exact, no rounding.

Example 2: Polynomial Product mod $998244353$

Multiply $f(x) = 1 + 2x + 3x^2$ by $g(x) = 4 + 5x$ mod $998244353$. The combined degree is $3$, so choose $N = 4$.

Compute $\omega = 3^{(998244353 - 1) / 4} \bmod 998244353$ (32-bit integer arithmetic suffices).

$f = [1, 2, 3, 0]$, $g = [4, 5, 0, 0]$ (zero-padded)
$F = \mathrm{NTT}(f), G = \mathrm{NTT}(g)$
$H[k] = F[k] G[k] \bmod p$
$h = \mathrm{INTT}(H) = [4, 13, 22, 15]$

That is $f \cdot g = 4 + 13x + 22x^2 + 15x^3$. Exact, no error.

Example 3: Polynomial Multiplication in the Kyber Ring ($p = 3329, N = 256$)

Multiply two polynomials in $\mathbb{Z}_{3329}[x] / (x^{256} + 1)$ via the negacyclic NTT. Since $2N = 512 \mid 3328$, a primitive 512-th root of unity $\psi$ exists. The Kyber reference implementation does:

// Convert inputs a, b to Montgomery form
// negacyclic NTT pre-twist: a'[i] = psi^i * a[i]
// Standard 256-point NTT
// Pointwise product
// Inverse 256-point NTT
// post-twist: a[i] = psi^{-i} * a'[i]

A single Kyber key-exchange polynomial multiplication runs in $\sim 10 \, \mu\mathrm{s}$ on modern CPUs.

14. FAQ

Q1. What is the difference between NTT and FFT?

The FFT uses the complex twiddle factor $e^{-j 2\pi / N}$; the NTT uses a primitive $N$-th root of unity in $\mathbb{Z}/p\mathbb{Z}$ derived from a primitive root. NTT works entirely in integer modular arithmetic, so round-off is zero. The Cooley-Tukey divide-and-conquer is identical.

Q2. What is an NTT-friendly prime?

A prime $p$ such that $p - 1$ has a large power-of-two factor. A standard example is $998244353 = 119 \cdot 2^{23} + 1$, supporting NTTs up to length $2^{23}$. Kyber uses $p = 3329 = 13 \cdot 2^8 + 1$ with $N = 256$; Dilithium uses $p = 8380417 = 2^{13} \cdot 1023 + 1$.

Q3. Where is NTT used in practice?

Multiprecision integer multiplication (GMP, sangi), polynomial multiplication, competitive programming, lattice cryptography (CRYSTALS-Kyber/Dilithium), fully homomorphic encryption (BFV/CKKS), and error-correcting codes (Reed-Solomon) — wherever exact integer arithmetic is required.

Q4. How can I convolve modulo an arbitrary modulus (e.g.\ $10^9 + 7$)?

Run NTTs under three NTT-friendly primes (e.g.\ $998244353, 985661441, 754974721$) and recombine with the Garner algorithm. Convert the result to the target modulus at the end. The total cost is roughly $3$–$4 \times$ a single NTT.

Q5. What is the negacyclic NTT?

An NTT variant that computes multiplications in $\mathbb{Z}_p[x]/(x^N + 1)$. The coefficients are pre-twisted by powers of a primitive $2N$-th root of unity $\psi$, after which a standard NTT yields the anti-cyclic convolution. Essential for Ring-LWE based lattice cryptography.

Q6. What is the Schönhage-Strassen algorithm?

A 1971 NTT-based integer multiplication algorithm that works in $\mathbb{Z}/(2^M + 1)\mathbb{Z}$ where $\omega = 2$ is a primitive $2M$-th root of unity. Multiplications by $\omega$ become bit shifts. It was the asymptotically fastest known algorithm ($O(n \log n \log \log n)$) until Harvey-van der Hoeven achieved true $O(n \log n)$ in 2019.

Q7. Is Montgomery multiplication essential?

Modular multiplication dominates the NTT hot loop, so eliminating division via Montgomery (or Barrett) reduction is the modern standard. It is 2–3$\times$ faster than naive (a * b) % p. Solinas primes such as $2^{61} - 1$ enable even further speed-ups.

Q8. How is NTT implemented in sangi?

sangi uses a 5-smooth NTT over $N = 2^a 3^b 5^c$ sizes with radix-3 and radix-5 butterflies. The NTT switch-over for big-integer multiplication triggers at $N \ge 1200$ words. See sangi Int Multiplication theory and the sangi FFT API for details.

15. References

Original Papers and Key Results

J. M. Pollard, "The fast Fourier transform in a finite field," Math. Comp., 25(114), pp. 365–374, 1971. (Mathematical prototype of the NTT)
A. Schönhage & V. Strassen, "Schnelle Multiplikation großer Zahlen," Computing, 7, pp. 281–292, 1971. (NTT-based integer multiplication)
P. L. Montgomery, "Modular multiplication without trial division," Math. Comp., 44(170), pp. 519–521, 1985.
D. Harvey & J. van der Hoeven, "Integer multiplication in time $O(n \log n)$," Annals of Math., 193(2), pp. 563–617, 2021.
V. Lyubashevsky, C. Peikert & O. Regev, "On ideal lattices and learning with errors over rings," J. ACM, 60(6), 2013. (Ring-LWE)

Books

J. von zur Gathen & J. Gerhard, Modern Computer Algebra, 3rd ed., Cambridge University Press, 2013.
D. E. Knuth, The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd ed., Addison-Wesley, 1997.
R. E. Blahut, Fast Algorithms for Signal Processing, Cambridge University Press, 2010.
D. J. Bernstein, J. Buchmann & E. Dahmen (eds.), Post-Quantum Cryptography, Springer, 2009.

Implementation in sangi (cross-links)

sangi — Integer multiplication algorithms — Karatsuba → Toom-3/4/6/8 → 5-smooth NTT hybrid strategy and thresholds
sangi — FFT API — C++ interface for the NTT/FFT modules underlying multiprecision multiplication

Related Chapters

Discrete Fourier and FFT (chapter hub)
Discrete Fourier Transform (DFT) — the transform computed by NTT
Fast Fourier Transform (FFT) — overview
Cooley-Tukey FFT — the algorithmic ancestor of NTT
Bluestein FFT — arbitrary-length DFT; finite-field analogue exists