Number-Theoretic Transform (NTT)
A Discrete Fourier Transform over Finite Fields
Advanced
Published: 2026-03-29
1. Definition and History
The Number-Theoretic Transform (NTT) is a discrete Fourier transform defined over a finite field $\mathbb{Z}/p\mathbb{Z}$ (with $p$ prime) instead of the complex field $\mathbb{C}$. It preserves the mathematical structure of the FFT while operating entirely on integer modular arithmetic, so round-off error never arises.
Historical Timeline
| Year | Researcher | Contribution |
|---|---|---|
| 1971 | Pollard | FFT construction over a finite field — the mathematical prototype of the NTT |
| 1971 | Schönhage & Strassen | Applied NTT to multiprecision integer multiplication, achieving $O(n \log n \log \log n)$ |
| 1990s | GMP, Magma, etc. | Computer algebra systems adopted NTT as the standard for high-end integer multiplication |
| 2007 | Lyubashevsky, Peikert, Regev | Ring-LWE problem — NTT accelerates polynomial-ring arithmetic for lattice cryptography |
| 2019 | Harvey & van der Hoeven | True $O(n \log n)$ integer multiplication algorithm (theoretical end-point) |
| 2024 | NIST | CRYSTALS-Kyber and Dilithium standardised for PQC — NTT becomes the cryptographic core |
Contrast with FFT
The FFT uses the complex twiddle factor $e^{-j 2\pi / N}$ and therefore suffers from unavoidable floating-point round-off error. The NTT operates purely on integer modular arithmetic, so no round-off error is introduced. This is the decisive advantage that makes NTT indispensable for multiprecision integer multiplication and lattice-based cryptography.
Correspondence between FFT and NTT
Figure 1: NTT inherits the algorithmic structure of the FFT (Cooley-Tukey divide-and-conquer, butterfly operations) and only swaps the underlying field from the complex numbers to a finite field. Only the choice of $\omega$ differs.
2. Primitive Roots and Primitive $N$-th Roots
Primitive Root
For a prime $p$, the multiplicative group $(\mathbb{Z}/p\mathbb{Z})^* = \{1, 2, \ldots, p-1\}$ is a cyclic group of order $p - 1$. An element $g$ that generates this group is called a primitive root:
$$\{g^0, g^1, g^2, \ldots, g^{p-2}\} = \{1, 2, \ldots, p-1\} \pmod p$$The order of $g$ equals $p - 1$ (no smaller positive integer $k$ satisfies $g^k \equiv 1 \pmod p$).
Existence and Count
- Existence: a primitive root exists for every prime $p$ (Gauss, 1801)
- Number: $\varphi(p - 1)$ primitive roots, where $\varphi$ is Euler's totient
- Algorithm: try $g = 2, 3, 5, \ldots$; for each prime factor $q$ of $p - 1$ verify that $g^{(p-1)/q} \not\equiv 1 \pmod p$. A primitive root is typically found within $O(\log p)$ trials
Primitive $N$-th Roots
When $N \mid (p - 1)$, the element $\omega = g^{(p-1)/N} \bmod p$ is a primitive $N$-th root of unity in $\mathbb{Z}/p\mathbb{Z}$:
- $\omega^N \equiv 1 \pmod p$ ($N$-th root of unity)
- $\omega^k \not\equiv 1 \pmod p$ for $0 < k < N$ (primitivity)
- $\omega^{N/2} \equiv -1 \pmod p$ (the "halving" property used by the FFT butterfly)
This $\omega$ is the NTT analogue of $e^{-j 2\pi / N}$. It satisfies the same FFT properties:
- Periodicity: $\omega^{k + N} = \omega^k$
- Symmetry: $\omega^{k + N/2} = -\omega^k$
- Halving: $\omega^2$ is a primitive $(N/2)$-th root of unity
These properties license the Cooley-Tukey divide-and-conquer structure verbatim.
Example: $p = 17, N = 4$
$p - 1 = 16$. A primitive root is $g = 3$ (verify: $3^1 = 3, 3^2 = 9, 3^4 = 81 \equiv 13, 3^8 \equiv 16 \equiv -1$, so the order is 16). For $N = 4$, $\omega = 3^{16/4} = 3^4 \equiv 13 \pmod{17}$.
Check: $13^2 = 169 \equiv 169 - 9 \cdot 17 = 16 \equiv -1 \pmod{17}$, and $13^4 \equiv (-1)^2 = 1 \pmod{17}$. ✓
3. NTT Formulation and Its Correspondence with FFT
Transform Definition
The NTT of an integer sequence $\{a[n]\}_{n=0}^{N-1}$ (with $a[n] \in \mathbb{Z}/p\mathbb{Z}$) is:
The inverse transform (INTT) is:
Here $N^{-1}$ is the modular inverse of $N$ (computed as $N^{p-2} \bmod p$ via Fermat's little theorem), and $\omega^{-1} = \omega^{N-1} \bmod p$.
Matrix Form
The NTT is a linear transform and admits a matrix representation:
$$\mathbf{A} = \Omega_N \, \mathbf{a}, \quad (\Omega_N)_{k,n} = \omega^{kn} \bmod p$$$\Omega_N$ is a symmetric Vandermonde-type matrix whose inverse is $\Omega_N^{-1} = N^{-1} \cdot \overline{\Omega_N}$ (with $(\overline{\Omega_N})_{k,n} = \omega^{-kn}$) — exactly mirroring the FFT.
Convolution Theorem
NTT Convolution Theorem
For the cyclic convolution $c[n] = \sum_k a[k] b[(n-k) \bmod N] \bmod p$ over $\mathbb{Z}_p$:
$$\mathrm{NTT}(c) = \mathrm{NTT}(a) \odot \mathrm{NTT}(b) \pmod p$$(where $\odot$ denotes element-wise product). This is isomorphic to the FFT convolution theorem and underpins all fast-convolution applications of NTT.
Side-by-Side Properties
| Property | FFT | NTT |
|---|---|---|
| Linearity | $\mathrm{F}(\alpha a + \beta b) = \alpha \mathrm{F}(a) + \beta \mathrm{F}(b)$ | Same (mod $p$) |
| Shift | $\mathrm{F}(a[(n-m) \bmod N]) = \omega^{mk} A[k]$ | Same (mod $p$) |
| Convolution theorem | $\mathrm{F}(a * b) = A \odot B$ | Same (mod $p$) |
| Parseval | $\sum |a[n]|^2 = \tfrac{1}{N} \sum |A[k]|^2$ | $\sum a[n]^2 \equiv N^{-1} \sum A[k] A[-k] \pmod p$ |
4. NTT-Friendly Primes
Running an NTT of length $N = 2^m$ requires $2^m \mid (p - 1)$ (so that $\omega = g^{(p-1)/N}$ is well-defined as an integer power). A prime $p$ in which $p - 1$ contains a large power-of-two factor is called NTT-friendly:
$$p = c \cdot 2^a + 1, \quad c \text{ small odd}, \quad a \text{ large}$$Major NTT-Friendly Primes
| Prime $p$ | Factorisation of $p - 1$ | Max $N$ | Primitive root $g$ | Typical use |
|---|---|---|---|---|
| $998244353$ | $119 \cdot 2^{23}$ | $2^{23} \approx 8.4 \cdot 10^6$ | $3$ | Competitive programming standard |
| $469762049$ | $7 \cdot 2^{26}$ | $2^{26} \approx 6.7 \cdot 10^7$ | $3$ | Arbitrary-mod convolution |
| $167772161$ | $5 \cdot 2^{25}$ | $2^{25} \approx 3.4 \cdot 10^7$ | $3$ | Arbitrary-mod convolution |
| $754974721$ | $45 \cdot 2^{24}$ | $2^{24} \approx 1.7 \cdot 10^7$ | $11$ | Auxiliary in 3-prime CRT |
| $2013265921$ | $15 \cdot 2^{27}$ | $2^{27} \approx 1.3 \cdot 10^8$ | $31$ | Large-scale convolution |
| $2281701377$ | $17 \cdot 2^{27}$ | $2^{27}$ | $3$ | Same as above |
| $3221225473$ | $3 \cdot 2^{30}$ | $2^{30} \approx 1.1 \cdot 10^9$ | $5$ | 30-bit large-scale |
| $3329$ | $2^8 \cdot 13$ | $2^8 = 256$ | $3$ | CRYSTALS-Kyber |
| $8380417$ | $2^{13} \cdot 1023$ | $2^{13}$ | $10$ | CRYSTALS-Dilithium |
Bit Width versus Use Case
| Bit width of $p$ | Representative | Use case |
|---|---|---|
| $\sim 12$ bit | $3329$ (Kyber) | Lattice cryptography, small coefficients |
| $\sim 23$ bit | $8380417$ (Dilithium) | Lattice signatures |
| $\sim 30$ bit | $998244353$ | Fits 32-bit arithmetic; competitive programming |
| $\sim 31$ bit | $2013265921$, $2281701377$ | Pair-up to recover 62-bit values |
| $\sim 62$ bit | Solinas primes, $2^{61} - 1$ | Bigint multiplication using a single prime |
2-Adic Valuation of $p - 1$
The exponent $a$ such that $p - 1 = c \cdot 2^a$ with $c$ odd is the 2-adic valuation $\nu_2(p - 1) = a$. This single number determines the maximum NTT length $N_{\max} = 2^a$. Searching for NTT-friendly primes amounts to generating primes with large $\nu_2(p - 1)$.
5. Fast Modular Arithmetic (Montgomery / Barrett)
The inner loop of an NTT is dominated by "multiply plus reduce mod $p$". A naive (a * b) % p is slow because integer division is slow on modern CPUs. Production implementations therefore replace divisions entirely.
Montgomery Multiplication
Introduced by Peter L. Montgomery in 1985. Choose $R = 2^k$ with $p < R$ (typically $k = 32$ or $64$) and keep every value in Montgomery form $\tilde{a} = a R \bmod p$. The Montgomery product is:
$$\mathrm{MontMul}(\tilde{a}, \tilde{b}) = \tilde{a} \tilde{b} R^{-1} \bmod p$$executed without division (Montgomery reduction). Only a single shift by $\log_2 R$ replaces the division.
Barrett Reduction
Introduced by Paul Barrett in 1986. Pre-compute $\mu = \lfloor 2^{2k} / p \rfloor$ once; then for any $0 \le x < p^2$:
$$x \bmod p \approx x - \lfloor x \mu / 2^{2k} \rfloor \cdot p$$The result is $x \bmod p$ or $x \bmod p + p$, so a final conditional subtraction is required. Unlike Montgomery, no pre/post conversion of inputs is needed.
Comparison
| Item | Montgomery | Barrett |
|---|---|---|
| Input conversion | Required ($a \to aR$) | Not required |
| Output conversion | Required ($aR^{-1} \to a$) | Not required |
| Cost per multiply | 2 mul + 1 add/sub | 3 mul + 1 add/sub |
| Conditional branches | 1 ($p$ subtraction check) | 1-2 |
| NTT inner loop | Montgomery wins (multiplications dominate) | Conversion overhead never amortised |
| One-off modular reduction | Conversion overhead | Barrett wins |
Solinas Primes (Special-Form Primes)
Primes of the form $p = 2^a - c$ (small $c$) or $p = 2^a + 2^b - 1$ admit specialised modular reduction (Solinas, 1999):
$$x \bmod (2^a - c) \approx (x \bmod 2^a) + c \lfloor x / 2^a \rfloor$$Combined with Montgomery in the NTT hot loop, this can eliminate all division-like instructions. Mersenne primes $2^{31} - 1$ and $2^{61} - 1$ are special cases of Solinas primes.
6. Algorithm
The NTT algorithm has exactly the same structure as the Cooley-Tukey FFT — only the complex multiplications are replaced by modular ones. An in-place iterative (decimation-in-time) version:
function ntt(a, N, p, omega):
// Step 1: bit-reversal permutation
a = bit_reverse_copy(a, N)
// Step 2: layered butterflies
for s = 1 to log2(N):
m = 2^s
wm = power(omega, N/m, p) // m-th root = omega^{N/m}
for k = 0 to N-1 step m:
w = 1
for r = 0 to m/2-1:
t = (w * a[k + r + m/2]) mod p
u = a[k + r]
a[k + r] = (u + t) mod p
a[k + r + m/2] = (u - t + p) mod p
w = (w * wm) mod p
return a
Cost per Pass
- Number of butterflies: $\tfrac{N}{2} \log_2 N$ (identical to FFT)
- Per butterfly: 1 multiply + 2 add/sub (all mod $p$)
- Total: $O(N \log N)$ modular multiplications
Montgomery-Form NTT
In practice every value is held in Montgomery form, and the inner-loop w * a[...] is executed via Montgomery multiplication. Powers of $\omega$ are pre-computed in Montgomery form:
// Pre-compute Montgomery-form roots
omega_mont = (omega * R) mod p
wm_table[s] = (omega^{N/(2^s)} * R) mod p for s = 1..log2(N)
// Hot inner loop (no division anywhere)
for r = 0 to m/2-1:
t = MontMul(w, a[k+r+m/2])
a[k+r+m/2] = SubMod(a[k+r], t)
a[k+r] = AddMod(a[k+r], t)
w = MontMul(w, wm_table[s])
Parallelism
- SIMD: AVX-512
VPMULQwith Montgomery reduction supports eight-way parallel modular multiplication - GPU: warp-level implementations are published in cuFHE and elsewhere
- Multi-core: the outer loops over butterfly stages parallelise cleanly with $O(\log N)$ synchronisation barriers
The inverse NTT is obtained by replacing $\omega$ with $\omega^{-1}$ and multiplying every output by $N^{-1} \bmod p$ — perfectly symmetric to the forward transform.
7. Negacyclic NTT and $\mathbb{Z}_p[x]/(x^N+1)$
The standard NTT computes a cyclic convolution (modulo $x^N - 1$). Lattice cryptography (Ring-LWE) instead requires an anti-cyclic (negacyclic) convolution modulo $x^N + 1$. The negacyclic NTT computes this directly.
The Anti-Cyclic Convolution
In the polynomial ring $\mathbb{Z}_p[x]/(x^N + 1)$ we have $x^N \equiv -1$, so shifting coefficients flips signs:
$$x^N \cdot a_n = -a_n$$The ordinary cyclic convolution is therefore unsuitable for ring multiplication.
Construction of the Negacyclic NTT
Take a primitive $2N$-th root of unity $\psi$ (which requires $2N \mid (p - 1)$, so $\psi^N \equiv -1 \pmod p$) and "twist" the coefficients before transforming:
$$\tilde{a}_n = \psi^n \cdot a_n$$Apply the standard length-$N$ NTT to $\tilde{a}$. After pointwise multiplication, an inverse twist $\psi^{-n}$ recovers the product in $\mathbb{Z}_p[x]/(x^N + 1)$.
The twist and untwist cost $O(N)$ each — negligible compared to the $O(N \log N)$ transform.
Implementation in CRYSTALS-Kyber
Kyber operates with $p = 3329$ and $N = 256$. Since $2N = 512 \mid 3328 = 2^8 \cdot 13$, the negacyclic NTT applies directly. Choosing $\psi$ as a primitive 512-th root of unity, all arithmetic in $\mathbb{Z}_{3329}[x]/(x^{256} + 1)$ runs through the NTT.
8. Polynomial Multiplication and CRT Composition
To compute the product $h(x) = f(x) \cdot g(x)$ of polynomials of degrees $n$ and $m$, the result has degree $n + m$ and $n + m + 1$ coefficients.
Linear-Convolution Polynomial Product via NTT
- Choose a power of two $N \ge n + m + 1$
- Zero-pad the coefficient arrays of $f, g$ to length $N$ (linear convolution = cyclic convolution after padding)
- Compute $F = \mathrm{NTT}(f)$ and $G = \mathrm{NTT}(g)$
- Pointwise: $H[k] = F[k] \cdot G[k] \bmod p$ ($O(N)$)
- Recover the product coefficients $h = \mathrm{INTT}(H)$
The complexity is $O(N \log N)$ versus the direct cost $O(nm)$ — a dramatic speed-up for large polynomials.
When Coefficients Exceed the Modulus: CRT Composition
If the product coefficients $c_i$ can exceed $p$ — so that $c_i \bmod p$ collides — perform the NTT product in parallel under several NTT-friendly primes $p_1, p_2, p_3$ and recombine with the Chinese Remainder Theorem (CRT):
- Run NTT products under $p_1 = 998244353$, $p_2 = 985661441$, $p_3 = 754974721$ (or similar)
- For each coefficient $c_i$, obtain $c_i \bmod p_j$ ($j = 1, 2, 3$)
- Recover $c_i \bmod (p_1 p_2 p_3)$ via the Garner algorithm (incremental CRT)
Capacity of the 3-Prime CRT
$p_1 p_2 p_3 \approx 7.6 \times 10^{26} \approx 2^{89}$ — about 89-bit coefficients can be recovered exactly. Even with $N = 2^{20}$ inputs each of magnitude $V = 2^{32}$, the worst-case coefficient bound $c_i \le N V^2$ fits comfortably.
Garner Algorithm
Efficient 3-prime CRT recovery:
// Input: x mod p1, x mod p2, x mod p3
// Output: x mod (p1*p2*p3)
inv_12 = modinv(p1, p2)
inv_123 = modinv(p1*p2 mod p3, p3)
v1 = x_mod_p1
v2 = (x_mod_p2 - v1) * inv_12 mod p2
v3 = ((x_mod_p3 - v1) - v2*p1) * inv_123 mod p3
return v1 + v2*p1 + v3*p1*p2
Implementation in sangi
The sangi library implements a 5-smooth NTT on sizes $N = 2^a 3^b 5^c$ with radix-3 and radix-5 butterflies. The set of usable sizes is much denser than the standard $N = 2^k$ alone, so the NTT engages for a wider class of inputs. The construction follows the mixed-radix Cooley-Tukey idea applied to NTT.
9. Multiprecision Integer Multiplication (Schönhage-Strassen)
To multiply $d$-digit integers $A, B$ with NTT:
- Represent each integer as a coefficient sequence in base $B$ (e.g.\ $B = 2^{16}$ or $B = 10^4$), giving polynomials of length $L = \lceil d / \log_B \rceil$
- NTT-multiply the polynomials to obtain the "digit" coefficients of $A \cdot B$ (which may exceed $B$)
- Carry propagation: split each $c_i$ into $c_i \bmod B$ plus a carry $\lfloor c_i / B \rfloor$ into the next position
The complexity is $O(d \log d \log \log d)$ (Schönhage-Strassen), or $O(d \log d \cdot 2^{O(\log^* d)})$ (Fürer 2007), and finally $O(d \log d)$ (Harvey-van der Hoeven 2019). This is asymptotically faster than schoolbook $O(d^2)$, Karatsuba $O(d^{1.585})$, and Toom-Cook $O(d^{1.465})$.
The Schönhage-Strassen Algorithm
The classical 1971 NTT-based integer-multiplication algorithm:
- Split: divide $A, B$ into $K$ blocks, treating them as polynomials of length $L = 2K$
- Work modulo $2^M + 1$: each coefficient is $M$ bits wide, with $M = O(d^{1/2})$ a typical choice
- Length-$L$ NTT: $\omega = 2^{M/L}$ becomes a primitive $L$-th root of unity
- Coefficient products: products of $M$-bit numbers are computed recursively by the same algorithm ($\log \log d$ levels of recursion)
- Inverse NTT plus carry propagation
NTT modulo $2^M + 1$
The Schönhage-Strassen variant works in the residue ring $\mathbb{Z}/(2^M + 1)\mathbb{Z}$ instead of $\mathbb{Z}_p$. $2$ becomes a primitive $2M$-th root of unity ($2^{2M} = (2^M)^2 \equiv (-1)^2 = 1$, plus primitivity), so $\omega = 2^{2M/L}$ multiplications reduce to bit shifts. This is the algorithm's signature speed-up.
Modern Library Implementations
| Library | Multiplication strategy | NTT role |
|---|---|---|
| GMP | Karatsuba → Toom-3/4/6/8 → FFT-style (NTT) | Engaged above ~10⁴ digits |
| FLINT | NTT-centric (from medium sizes upward) | Both polynomials and integers |
| NTL | SS-Modular for big integers | Standard in cryptographic research |
| sangi | Karatsuba → Toom-3/4/6/8 → 5-smooth NTT | Engaged at $N \ge 1200$ words |
Real implementations follow a hybrid strategy: schoolbook/Karatsuba for small inputs, Toom-Cook for medium, NTT for large. Crossover thresholds depend on hardware and implementation.
10. Lattice-Based and Post-Quantum Cryptography
The NIST-standardised post-quantum cryptography (PQC) schemes CRYSTALS-Kyber and CRYSTALS-Dilithium (2024) both rely on NTT for fast polynomial-ring arithmetic.
Ring-LWE and Polynomial Arithmetic
The Ring Learning With Errors (RLWE) problem is set in the polynomial ring $R = \mathbb{Z}_p[x]/(x^N + 1)$:
$$b = a \cdot s + e \pmod{p, x^N + 1}$$Recovering the secret $s$ is computationally hard. With $a, s, e \in R$ as polynomials of degree $N - 1$ for fairly large $N$ ($256, 512, 1024$), the product $a \cdot s$ dominates the cost and is sped up to $O(N \log N)$ by the NTT.
Major PQC Schemes
| Scheme | $p$ | $N$ | Purpose | Standardisation |
|---|---|---|---|---|
| CRYSTALS-Kyber (ML-KEM) | $3329$ | $256$ | Key encapsulation | NIST FIPS 203 (2024) |
| CRYSTALS-Dilithium (ML-DSA) | $8380417$ | $256$ | Signature | NIST FIPS 204 (2024) |
| Falcon | $12289$ | $512, 1024$ | Compact signature | NIST FIPS 206 (in progress) |
| NewHope (legacy) | $12289$ | $1024$ | Key exchange | Absorbed into Kyber |
NTT in Fully Homomorphic Encryption (FHE)
BFV, BGV, and CKKS — all based on the same polynomial-ring structure — use NTT in encryption, decryption, and homomorphic multiplication. Standard libraries (Microsoft SEAL, IBM HElib, Lattigo) implement NTT pervasively.
Side-Channel Resistance and Constant-Time Implementation
NTT for cryptographic use must be implemented in constant time. Data-dependent branches or memory accesses leak secret values through timing attacks. The final conditional subtraction in Montgomery reduction is replaced by a masked branchless form (always subtract, then mask-select the right value).
11. Competitive Programming
"Convolution mod $p$" is a recurring competitive-programming theme, and the NTT is an essential technique.
Typical Problems
- Polynomial multiplication: product of degree-$10^5$ polynomials mod $998244353$
- Convolution counting: $c_k = \sum_{i+j=k} a_i b_j$
- Subset-sum counts: generating functions for combinatorial enumeration
- Formal power series: $\exp, \log, \sqrt{}, \mathrm{inv}$ truncated mod $x^n$ via Newton iteration + NTT
Arbitrary-Modulus Convolution
Problems often demand convolution under a modulus other than $998244353$ (e.g.\ $10^9 + 7$). The standard recipe is the three-prime NTT: run independent NTTs under $p_1, p_2, p_3$ and apply Garner to reconstruct the target. Total cost is roughly $3$–$4 \times$ a single NTT.
AtCoder Library Convolution
The AtCoder Library (ACL) C++ template atcoder::convolution<Mod> ships an NTT-based convolution that picks the optimal strategy automatically. It is fastest at $\mathrm{Mod} = 998244353$ but supports other moduli too.
12. NTT vs FFT Comparison
| Item | FFT (complex) | NTT (integer) |
|---|---|---|
| Underlying field | $\mathbb{C}$ (floating-point) | $\mathbb{Z}/p\mathbb{Z}$ (finite field) |
| Twiddle factor | $e^{-j 2\pi / N}$ | $g^{(p-1)/N} \bmod p$ |
| Error | $O(\sqrt{N} \, \epsilon_{\mathrm{mach}})$ | Exact (0) |
| Cost per multiply | 4 fp mul + 2 fp add | 1 int mul + mod reduce |
| SIMD parallelism | Excellent (FP units) | Good (integer units) |
| Constraint on $N$ | Power-of-two is best | Requires $N \mid (p - 1)$ |
| Upper bound on $N$ | Practically unbounded | Limited by $\nu_2(p - 1)$ |
| Coefficient domain | Continuous reals | $\{0, 1, \ldots, p - 1\}$ |
| Primary use cases | Signal processing, PDEs, imaging | Bigint, cryptography, competitive programming |
| Hardware | FP SIMD (AVX/SSE) | Integer SIMD, PQC ASICs |
| Reference libraries | FFTW, cuFFT, MKL | FLINT, NTL, ACL |
When to Choose Which
- Continuous values, approximation acceptable: FFT (signal processing, PDE solvers)
- Exact integer results required: NTT (bigint, cryptography, competitive programming)
- Arbitrary $N$ with no 2-adic constraint: FFT plus Bluestein fall-back
- Fixed $N$, speed paramount: NTT (Kyber/Dilithium)
13. Worked Examples
Example 1: Minimal NTT with $p = 5$, $N = 4$
$p = 5$, primitive root $g = 2$, $\omega = g^{(5-1)/4} = 2^1 = 2$ (verify $2^4 = 16 \equiv 1 \pmod 5$).
Transforming $a = [1, 2, 3, 4]$:
$$A[0] = (1 + 2 + 3 + 4) \bmod 5 = 10 \bmod 5 = 0$$ $$A[1] = (1 + 2 \cdot 2 + 3 \cdot 4 + 4 \cdot 3) \bmod 5 = 29 \bmod 5 = 4$$ $$A[2] = (1 + 2 \cdot 4 + 3 \cdot 1 + 4 \cdot 4) \bmod 5 = 28 \bmod 5 = 3$$ $$A[3] = (1 + 2 \cdot 3 + 3 \cdot 4 + 4 \cdot 2) \bmod 5 = 27 \bmod 5 = 2$$Result: $A = [0, 4, 3, 2]$. Every operation is integer modular arithmetic — exact, no rounding.
Example 2: Polynomial Product mod $998244353$
Multiply $f(x) = 1 + 2x + 3x^2$ by $g(x) = 4 + 5x$ mod $998244353$. The combined degree is $3$, so choose $N = 4$.
Compute $\omega = 3^{(998244353 - 1) / 4} \bmod 998244353$ (32-bit integer arithmetic suffices).
- $f = [1, 2, 3, 0]$, $g = [4, 5, 0, 0]$ (zero-padded)
- $F = \mathrm{NTT}(f), G = \mathrm{NTT}(g)$
- $H[k] = F[k] G[k] \bmod p$
- $h = \mathrm{INTT}(H) = [4, 13, 22, 15]$
That is $f \cdot g = 4 + 13x + 22x^2 + 15x^3$. Exact, no error.
Example 3: Polynomial Multiplication in the Kyber Ring ($p = 3329, N = 256$)
Multiply two polynomials in $\mathbb{Z}_{3329}[x] / (x^{256} + 1)$ via the negacyclic NTT. Since $2N = 512 \mid 3328$, a primitive 512-th root of unity $\psi$ exists. The Kyber reference implementation does:
// Convert inputs a, b to Montgomery form
// negacyclic NTT pre-twist: a'[i] = psi^i * a[i]
// Standard 256-point NTT
// Pointwise product
// Inverse 256-point NTT
// post-twist: a[i] = psi^{-i} * a'[i]
A single Kyber key-exchange polynomial multiplication runs in $\sim 10 \, \mu\mathrm{s}$ on modern CPUs.
14. FAQ
Q1. What is the difference between NTT and FFT?
The FFT uses the complex twiddle factor $e^{-j 2\pi / N}$; the NTT uses a primitive $N$-th root of unity in $\mathbb{Z}/p\mathbb{Z}$ derived from a primitive root. NTT works entirely in integer modular arithmetic, so round-off is zero. The Cooley-Tukey divide-and-conquer is identical.
Q2. What is an NTT-friendly prime?
A prime $p$ such that $p - 1$ has a large power-of-two factor. A standard example is $998244353 = 119 \cdot 2^{23} + 1$, supporting NTTs up to length $2^{23}$. Kyber uses $p = 3329 = 13 \cdot 2^8 + 1$ with $N = 256$; Dilithium uses $p = 8380417 = 2^{13} \cdot 1023 + 1$.
Q3. Where is NTT used in practice?
Multiprecision integer multiplication (GMP, sangi), polynomial multiplication, competitive programming, lattice cryptography (CRYSTALS-Kyber/Dilithium), fully homomorphic encryption (BFV/CKKS), and error-correcting codes (Reed-Solomon) — wherever exact integer arithmetic is required.
Q4. How can I convolve modulo an arbitrary modulus (e.g.\ $10^9 + 7$)?
Run NTTs under three NTT-friendly primes (e.g.\ $998244353, 985661441, 754974721$) and recombine with the Garner algorithm. Convert the result to the target modulus at the end. The total cost is roughly $3$–$4 \times$ a single NTT.
Q5. What is the negacyclic NTT?
An NTT variant that computes multiplications in $\mathbb{Z}_p[x]/(x^N + 1)$. The coefficients are pre-twisted by powers of a primitive $2N$-th root of unity $\psi$, after which a standard NTT yields the anti-cyclic convolution. Essential for Ring-LWE based lattice cryptography.
Q6. What is the Schönhage-Strassen algorithm?
A 1971 NTT-based integer multiplication algorithm that works in $\mathbb{Z}/(2^M + 1)\mathbb{Z}$ where $\omega = 2$ is a primitive $2M$-th root of unity. Multiplications by $\omega$ become bit shifts. It was the asymptotically fastest known algorithm ($O(n \log n \log \log n)$) until Harvey-van der Hoeven achieved true $O(n \log n)$ in 2019.
Q7. Is Montgomery multiplication essential?
Modular multiplication dominates the NTT hot loop, so eliminating division via Montgomery (or Barrett) reduction is the modern standard. It is 2–3$\times$ faster than naive (a * b) % p. Solinas primes such as $2^{61} - 1$ enable even further speed-ups.
Q8. How is NTT implemented in sangi?
sangi uses a 5-smooth NTT over $N = 2^a 3^b 5^c$ sizes with radix-3 and radix-5 butterflies. The NTT switch-over for big-integer multiplication triggers at $N \ge 1200$ words. See sangi Int Multiplication theory and the sangi FFT API for details.
15. References
Original Papers and Key Results
- J. M. Pollard, "The fast Fourier transform in a finite field," Math. Comp., 25(114), pp. 365–374, 1971. (Mathematical prototype of the NTT)
- A. Schönhage & V. Strassen, "Schnelle Multiplikation großer Zahlen," Computing, 7, pp. 281–292, 1971. (NTT-based integer multiplication)
- P. L. Montgomery, "Modular multiplication without trial division," Math. Comp., 44(170), pp. 519–521, 1985.
- D. Harvey & J. van der Hoeven, "Integer multiplication in time $O(n \log n)$," Annals of Math., 193(2), pp. 563–617, 2021.
- V. Lyubashevsky, C. Peikert & O. Regev, "On ideal lattices and learning with errors over rings," J. ACM, 60(6), 2013. (Ring-LWE)
Books
- J. von zur Gathen & J. Gerhard, Modern Computer Algebra, 3rd ed., Cambridge University Press, 2013.
- D. E. Knuth, The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd ed., Addison-Wesley, 1997.
- R. E. Blahut, Fast Algorithms for Signal Processing, Cambridge University Press, 2010.
- D. J. Bernstein, J. Buchmann & E. Dahmen (eds.), Post-Quantum Cryptography, Springer, 2009.
Implementation in sangi (cross-links)
- sangi — Integer multiplication algorithms — Karatsuba → Toom-3/4/6/8 → 5-smooth NTT hybrid strategy and thresholds
- sangi — FFT API — C++ interface for the NTT/FFT modules underlying multiprecision multiplication
Related Chapters
- Discrete Fourier and FFT (chapter hub)
- Discrete Fourier Transform (DFT) — the transform computed by NTT
- Fast Fourier Transform (FFT) — overview
- Cooley-Tukey FFT — the algorithmic ancestor of NTT
- Bluestein FFT — arbitrary-length DFT; finite-field analogue exists