Chapter 4: Fast Multiplication Algorithms

Karatsuba, Toom-Cook, FFT/NTT, and Beyond

In the previous chapter we saw that schoolbook multiplication of multiprecision integers runs in $O(n^2)$ time. This chapter studies in detail the fast multiplication algorithms that dramatically improve on this complexity: the Karatsuba algorithm ($O(n^{1.585})$), Toom-Cook ($O(n^{1.465})$), and FFT multiplication ($O(n \log n \log \log n)$), with mathematical proofs, concrete numeric examples, and pseudocode.

Multiplication is the central operation of multiprecision arithmetic; division, square root, and elementary functions all reduce to multiplication. Speeding up multiplication is therefore the key that determines the performance of every multiprecision operation.

1. Karatsuba Algorithm

In 1962 Anatolii Karatsuba discovered a landmark algorithm that overturned Kolmogorov's conjecture that $n$-digit multiplication necessarily requires $O(n^2)$ time. The idea is simple and beautiful: reduce $n$-digit multiplication to three $n/2$-digit multiplications instead of four.

1.1 Principle

Principle of the Karatsuba algorithm

Split the $n$-limb integers $A$ and $C$ into upper and lower halves (with $m = \lceil n/2 \rceil$):

$$A = A_1 \cdot B^{m} + A_0, \quad C = C_1 \cdot B^{m} + C_0$$

Here $B$ is the radix (e.g. $B = 2^{32}$ or $B = 2^{64}$), and $A_0, A_1, C_0, C_1$ are integers of at most $m$ limbs. $A_0$ and $C_0$ are the lower $m$ limbs; $A_1$ and $C_1$ are the upper limbs.

1.2 Theorem and proof

Theorem (correctness of the Karatsuba algorithm)

$A \times C$ can be computed using only three multiplications:

$$A \times C = z_2 \cdot B^{2m} + z_1 \cdot B^{m} + z_0$$

where:

$z_0 = A_0 \times C_0$ (product of the lower halves)
$z_2 = A_1 \times C_1$ (product of the upper halves)
$z_1 = (A_0 + A_1)(C_0 + C_1) - z_0 - z_2$ (cross term)

Proof

Step 1: direct expansion

Multiplying $A = A_1 B^m + A_0$ and $C = C_1 B^m + C_0$ directly gives:

$$\begin{align} A \times C &= (A_1 B^m + A_0)(C_1 B^m + C_0) \\ &= A_1 C_1 \cdot B^{2m} + (A_1 C_0 + A_0 C_1) \cdot B^m + A_0 C_0 \end{align}$$

This expression contains four multiplications: $A_1 C_1$, $A_1 C_0$, $A_0 C_1$, $A_0 C_0$.

Step 2: rewriting the middle term (Karatsuba's trick)

Expanding $(A_0 + A_1)(C_0 + C_1)$:

$$(A_0 + A_1)(C_0 + C_1) = A_0 C_0 + A_0 C_1 + A_1 C_0 + A_1 C_1$$

Therefore the cross term can be expressed using $z_0$ and $z_2$ that we have already computed:

$$A_1 C_0 + A_0 C_1 = (A_0 + A_1)(C_0 + C_1) - A_0 C_0 - A_1 C_1 = z_1$$

Step 3: conclusion

Setting $z_0 = A_0 C_0$, $z_2 = A_1 C_1$, and $z_1 = (A_0 + A_1)(C_0 + C_1) - z_0 - z_2$:

$$A \times C = z_2 \cdot B^{2m} + z_1 \cdot B^m + z_0$$

The required $n/2$-limb multiplications are only the three products $z_0$, $z_2$, and $(A_0 + A_1)(C_0 + C_1)$; the rest is additions/subtractions ($O(n)$) and shifts ($O(n)$). $\square$

Comparison with schoolbook: where does the saving come from?

Schoolbook multiplication needs four multiplications: $A_0 C_0$, $A_0 C_1$, $A_1 C_0$, $A_1 C_1$. The Karatsuba algorithm avoids computing $A_0 C_1$ and $A_1 C_0$ separately, instead obtaining their sum $A_0 C_1 + A_1 C_0$ from a single multiplication, reducing the multiplication count from 4 to 3.

The trade-off is more additions and subtractions, but additions and subtractions are $O(n)$ — far lighter than the $O(n^2)$ multiplications — so the overall scheme is faster.

1.3 Complexity analysis

Theorem (complexity of the Karatsuba algorithm)

Let $T(n)$ be the time of $n$-limb multiplication. The Karatsuba algorithm satisfies:

$$T(n) = 3T(\lceil n/2 \rceil) + O(n)$$

Hence $T(n) = O(n^{\log_2 3}) = O(n^{1.585\ldots})$.

Proof of the complexity

Apply the Master Theorem. For the recurrence $T(n) = aT(n/b) + f(n)$:

$a = 3$ (number of subproblems = number of multiplications)
$b = 2$ (subproblem size shrinkage)
$f(n) = O(n)$ (cost of additions/subtractions and shifts)

The compare exponent is $\log_b a = \log_2 3 \approx 1.585$. Since $f(n) = O(n) = O(n^{1.585 - 0.585})$, we have $f(n) = O(n^{\log_b a - \epsilon})$ with $\epsilon = 0.585 > 0$.

Case 1 of the Master Theorem yields:

$$T(n) = \Theta(n^{\log_2 3}) \approx O(n^{1.585})$$

Compared to the schoolbook $O(n^2)$, for example with $n = 1000$ limbs:

Schoolbook: $1000^2 = 10^6$ multiplications
Karatsuba: $1000^{1.585} \approx 47800$ multiplications (about 21x faster)

The gap widens as $n$ grows. $\square$

1.4 Pseudocode

Karatsuba pseudocode

function karatsuba(A[0..n-1], C[0..n-1]):
    // Base case: small integers use schoolbook
    if n <= THRESHOLD:
        return schoolbook_multiply(A, C)

    // Split
    m = ceil(n / 2)
    A0 = A[0..m-1]         // lower m limbs
    A1 = A[m..n-1]         // upper limbs
    C0 = C[0..m-1]         // lower m limbs
    C1 = C[m..n-1]         // upper limbs

    // Three recursive multiplications
    z0 = karatsuba(A0, C0)              // A0 * C0
    z2 = karatsuba(A1, C1)              // A1 * C1
    z1 = karatsuba(A0 + A1, C0 + C1)   // (A0+A1)(C0+C1)
    z1 = z1 - z0 - z2                  // cross term

    // Combine (shift + add)
    result = z0
    result += z1 << (m * BITS_PER_LIMB)    // z1 * B^m
    result += z2 << (2*m * BITS_PER_LIMB)  // z2 * B^(2m)
    return result

THRESHOLD is the cutoff below which the algorithm falls back to schoolbook; typical values are 30-80 limbs. The optimal value depends on the architecture and cache size and is determined by benchmarking.

Implementation notes

Carry in $A_0 + A_1$: $A_0$ and $A_1$ each fit in $m$ limbs, but their sum $A_0 + A_1$ may need $m + 1$ limbs. This extra limb must be handled correctly.
Sign of $z_1$: $z_1 = (A_0 + A_1)(C_0 + C_1) - z_0 - z_2$ is not necessarily non-negative. While $A_0 C_1 + A_1 C_0 \geq 0$ always holds, intermediate computations may need to deal with temporarily large values.
Multiplications of unequal sizes: For $n$-limb $\times$ $m$-limb with $n \neq m$, either zero-pad the shorter one or use an asymmetric split.
Memory management: Allocating temporary arrays at every recursive call inflates memory usage. GMP manages temporaries on a stack-like scratch area to avoid unnecessary allocations.

1.5 Concrete numeric example

Example 1: $1234 \times 5678$ via Karatsuba ($B = 100$)

Use radix $B = 100$ (split into pairs of digits).

$A = 1234 = 12 \times 100 + 34$, $C = 5678 = 56 \times 100 + 78$

$A_1 = 12$, $A_0 = 34$
$C_1 = 56$, $C_0 = 78$

Step 1: perform the three multiplications

$z_0 = A_0 \times C_0 = 34 \times 78 = 2652$
$z_2 = A_1 \times C_1 = 12 \times 56 = 672$
$(A_0 + A_1)(C_0 + C_1) = (34 + 12)(78 + 56) = 46 \times 134 = 6164$

Step 2: compute the cross term

$z_1 = 6164 - 2652 - 672 = 2840$

Step 3: combine

$$\begin{align} A \times C &= z_2 \times B^2 + z_1 \times B + z_0 \\ &= 672 \times 10000 + 2840 \times 100 + 2652 \\ &= 6720000 + 284000 + 2652 \\ &= 7006652 \end{align}$$

Verification: $1234 \times 5678 = 7006652$ ✓

Example 2: applying it recursively ($B = 10$)

Apply Karatsuba to $A = 1234$, $C = 5678$ with $B = 10$, recursing two levels.

First-level split ($m = 2$, $B^m = 100$):

$A = 12 \times 100 + 34$, $C = 56 \times 100 + 78$

Apply Karatsuba again to $z_0 = 34 \times 78$ ($m = 1$, $B^m = 10$):

$34 = 3 \times 10 + 4$, $78 = 7 \times 10 + 8$
$z_0' = 4 \times 8 = 32$
$z_2' = 3 \times 7 = 21$
$(4+3)(8+7) = 7 \times 15 = 105$
$z_1' = 105 - 32 - 21 = 52$
$34 \times 78 = 21 \times 100 + 52 \times 10 + 32 = 2100 + 520 + 32 = 2652$ ✓

Similarly, $z_2 = 12 \times 56 = 672$ and $(A_0+A_1)(C_0+C_1) = 46 \times 134 = 6164$ are computed recursively.

Visualization of the Karatsuba algorithm

Figure 1: Karatsuba processing flow — digit splitting compresses to three recursive multiplications, then recombines.

2. Toom-Cook Algorithm

The Toom-Cook algorithm is a natural generalization of Karatsuba: it represents integers as polynomials and performs multiplication via polynomial interpolation. While Karatsuba uses a 2-way split with 3 evaluation points, Toom-$k$ uses a $k$-way split with $2k - 1$ evaluation points.

2.1 Principle of Toom-3

Toom-3 (Toom-Cook-3)

Split the $n$-limb integers $A$ and $C$ into three pieces (with $m = \lceil n/3 \rceil$):

$$A = A_2 \cdot R^{2} + A_1 \cdot R + A_0, \quad C = C_2 \cdot R^{2} + C_1 \cdot R + C_0$$

where $R = B^m$ ($B$ is the radix). View these as polynomials:

$$p(x) = A_2 x^2 + A_1 x + A_0, \quad q(x) = C_2 x^2 + C_1 x + C_0$$

Then $A = p(R)$, $C = q(R)$, and $A \times C = p(R) \cdot q(R) = r(R)$.

2.2 Proof of correctness

Theorem (correctness of Toom-3)

The product polynomial $r(x) = p(x) \cdot q(x)$ has degree at most 4, hence five coefficients:

$$r(x) = r_4 x^4 + r_3 x^3 + r_2 x^2 + r_1 x + r_0$$

These five coefficients can be determined from values at five points using five multiplications.

Proof

Step 1: Evaluation

Evaluate $p(x)$ and $q(x)$ at the five points $x = 0, 1, -1, 2, \infty$:

$x$	$p(x)$	$q(x)$	Cost
$0$	$A_0$	$C_0$	none
$1$	$A_2 + A_1 + A_0$	$C_2 + C_1 + C_0$	2 additions
$-1$	$A_2 - A_1 + A_0$	$C_2 - C_1 + C_0$	2 add/sub
$2$	$4A_2 + 2A_1 + A_0$	$4C_2 + 2C_1 + C_0$	shift + add
$\infty$	$A_2$ (leading coefficient)	$C_2$ (leading coefficient)	none

Each evaluated value is computed using only additions/subtractions and shifts ($O(n)$).

Derivation: what does $r(\infty)$ mean?

Formally, define it as the limit:

$$\displaystyle r(\infty) := \lim_{x \to \infty} \dfrac{r(x)}{x^{\deg r}} = \lim_{x \to \infty} \dfrac{r_0 + r_1 x + \cdots + r_4 x^4}{x^4} = r_4.$$

Likewise $p(\infty) = A_2$ and $q(\infty) = C_2$. Thus $r(\infty) = p(\infty) \cdot q(\infty) = A_2 C_2$ is the product of the leading coefficients and requires no extra multiplication (the information is already on hand). This is the advantage of using "infinity" as an evaluation point.

Step 2: Pointwise multiplication

Compute $r(x_i) = p(x_i) \cdot q(x_i)$ at each point. These are the five multiplications (executed recursively):

$r(0) = p(0) \cdot q(0) = A_0 C_0$
$r(1) = p(1) \cdot q(1) = (A_0 + A_1 + A_2)(C_0 + C_1 + C_2)$
$r(-1) = p(-1) \cdot q(-1) = (A_0 - A_1 + A_2)(C_0 - C_1 + C_2)$
$r(2) = p(2) \cdot q(2) = (A_0 + 2A_1 + 4A_2)(C_0 + 2C_1 + 4C_2)$
$r(\infty) = A_2 C_2$

Step 3: Interpolation

Recover the coefficients $r_0, r_1, r_2, r_3, r_4$ from the five values $r(0), r(1), r(-1), r(2), r(\infty)$. This is equivalent to applying the inverse of a Vandermonde matrix:

$$\begin{pmatrix} r(0) \\ r(1) \\ r(-1) \\ r(2) \\ r(\infty) \end{pmatrix} = \begin{pmatrix} 1 & 0 & 0 & 0 & 0 \\ 1 & 1 & 1 & 1 & 1 \\ 1 & -1 & 1 & -1 & 1 \\ 1 & 2 & 4 & 8 & 16 \\ 0 & 0 & 0 & 0 & 1 \end{pmatrix} \begin{pmatrix} r_0 \\ r_1 \\ r_2 \\ r_3 \\ r_4 \end{pmatrix}$$

The inverse matrix is a constant (independent of input size), so coefficient recovery requires only additions/subtractions and divisions by small constants ($O(n)$).

Step 4: Recomposition

$$A \times C = r(R) = r_4 R^4 + r_3 R^3 + r_2 R^2 + r_1 R + r_0$$

Since $R = B^m$, this is performed using shifts and additions ($O(n)$). $\square$

2.3 Concrete interpolation formula

Toom-3 interpolation formula

The closed-form formulas to recover the coefficients from the evaluated values are:

$$\begin{align} r_0 &= r(0) \\ r_4 &= r(\infty) \\ t_1 &= \dfrac{r(1) - r(-1)}{2} \quad \text{(odd part)} \\ t_2 &= \dfrac{r(1) + r(-1)}{2} - r(0) \quad \text{(even part minus } r_0 \text{)} \\ r_3 &= \dfrac{r(2) - r(0) - 4t_2 - 2t_1}{6} - r_4 \\ r_1 &= t_1 - r_3 \\ r_2 &= t_2 - r_4 \end{align}$$

$t_1$ and $t_2$ are intermediate variables. Importantly, division by 2 is a bit shift, and division by 6 decomposes into divisions by 2 and 3. Division by 3 can be implemented as a multiplication by a precomputed inverse.

Derivation: where does the interpolation formula come from?

Substituting the evaluation points into $r(x) = r_0 + r_1 x + r_2 x^2 + r_3 x^3 + r_4 x^4$ gives a system of equations:

$$\displaystyle r(0) = r_0, \quad r(\infty) = r_4, \quad r(1) = r_0+r_1+r_2+r_3+r_4, \quad r(-1) = r_0-r_1+r_2-r_3+r_4.$$

Step 1 (parity separation): $r(1) \pm r(-1)$ separates the even-degree and odd-degree terms.

$$\dfrac{r(1) - r(-1)}{2} = r_1 + r_3 \;(\equiv t_1), \quad \dfrac{r(1) + r(-1)}{2} - r_0 = r_2 + r_4 \;(\equiv t_2).$$

Step 2 (extracting $r_3$): from $r(2) = r_0 + 2r_1 + 4r_2 + 8r_3 + 16r_4$, subtract $r_0$, $4t_2 = 4r_2 + 4r_4$, and $2t_1 = 2r_1 + 2r_3$ to leave $6r_3 + 12r_4$. Dividing by 6 and subtracting $r_4$ gives $r_3$.

Step 3: $r_1 = t_1 - r_3$ and $r_2 = t_2 - r_4$ recover the rest. This is exactly the result of applying the inverse Vandermonde matrix. $\square$

2.4 Complexity analysis

Theorem (complexity of Toom-3)

From the recurrence $T(n) = 5T(n/3) + O(n)$, the Master Theorem gives:

$$T(n) = \Theta(n^{\log_3 5}) \approx O(n^{1.465})$$

This is asymptotically faster than Karatsuba's $O(n^{1.585})$.

2.5 General Toom-$k$

Comparison of general Toom-$k$

With a $k$-way split, the product polynomial has degree at most $2(k-1)$, so $2k - 1$ evaluation points are required.

Method	Split $k$	Multiplications $2k-1$	Complexity	Exponent
Karatsuba (Toom-2)	2	3	$O(n^{\log_2 3})$	$\approx 1.585$
Toom-3	3	5	$O(n^{\log_3 5})$	$\approx 1.465$
Toom-4	4	7	$O(n^{\log_4 7})$	$\approx 1.404$
Toom-$k$	$k$	$2k-1$	$O(n^{\log_k(2k-1)})$	$\to 1$

As $k \to \infty$ the exponent approaches $1$, but the constant factor of the interpolation step (number of additions, subtractions, and small-integer divisions) grows as $O(k^2)$, so in practice only Toom-3 or Toom-4 are used. Beyond that, FFT-based methods are more advantageous.

Why $2k - 1$ multiplications suffice

The product of two polynomials of degree $k - 1$ has degree $2(k - 1) = 2k - 2$. A polynomial of degree $2k - 2$ has $2k - 1$ coefficients and, by Lagrange's interpolation theorem, is uniquely determined by its values at $2k - 1$ points. Therefore, computing $2k - 1$ evaluation values (each via an $n/k$-limb multiplication) suffices to recover the coefficients. $\square$

3. FFT Multiplication

In Toom-Cook, increasing $k$ pushes the asymptotic exponent toward $1$, but the interpolation overhead grows. Using the Fast Fourier Transform (FFT), one can attain essentially $O(n \log n)$ time for multiplication.

3.1 Multiplication and convolution

Integer multiplication as polynomial multiplication

An $n$-limb integer $A = \displaystyle\sum_{i=0}^{n-1} a_i B^i$ can be formally identified with the polynomial $A(x) = \displaystyle\sum_{i=0}^{n-1} a_i x^i$. The product $P = A \times C$ of two integers is the polynomial product $P(x) = A(x) \cdot C(x)$ evaluated at $x = B$ (with carry propagation).

The coefficients of the product polynomial are given by the convolution:

$$p_k = \displaystyle\sum_{i+j=k} a_i c_j = \displaystyle\sum_{i=0}^{k} a_i c_{k-i}$$

Direct computation costs $O(n^2)$, but using FFT with the convolution theorem reduces it to $O(n \log n)$.

3.2 The convolution theorem

Convolution theorem

For length-$N$ sequences $\{a_i\}$ and $\{c_i\}$ with convolution $\{p_k\}$:

$$\mathcal{F}(a * c) = \mathcal{F}(a) \cdot \mathcal{F}(c)$$

where $\mathcal{F}$ denotes the discrete Fourier transform (DFT) and the $\cdot$ on the right is pointwise multiplication. Hence:

$$a * c = \mathcal{F}^{-1}\bigl(\mathcal{F}(a) \cdot \mathcal{F}(c)\bigr)$$

Since the DFT is computed by FFT (Cooley-Tukey) in $O(N \log N)$, the entire convolution takes $O(N \log N)$.

3.3 The FFT multiplication algorithm

FFT multiplication procedure

Zero-padding: extend the coefficient sequences of $A$ and $C$ to length $N \geq 2n$ (rounded up to a power of two), filling the high-order positions with zeros. This is required to obtain a linear convolution rather than a cyclic one.
Forward transform: compute $\hat{A} = \text{FFT}(A)$ and $\hat{C} = \text{FFT}(C)$ (each in $O(N \log N)$).
Pointwise product: compute $\hat{P}_k = \hat{A}_k \cdot \hat{C}_k$ for $k = 0, 1, \ldots, N-1$ (in $O(N)$).
Inverse transform: $P = \text{IFFT}(\hat{P})$ recovers the convolution (in $O(N \log N)$).
Carry propagation: when $p_k \geq B$, keep $p_k \bmod B$ and add $\lfloor p_k / B \rfloor$ to the higher position (in $O(N)$).

Example: conceptual flow of FFT multiplication ($A = (3, 2, 1)$, $C = (6, 5, 4)$, $B = 10$)

Multiply $A = 123$ and $C = 456$ in radix $B = 10$.

Step 1: zero-padding (extend to length $N = 8$)

$a = (3, 2, 1, 0, 0, 0, 0, 0)$
$c = (6, 5, 4, 0, 0, 0, 0, 0)$

Steps 2-3: FFT → pointwise product → IFFT

(The convolution yields the following coefficient sequence.)

$p = (18, 27, 28, 13, 4, 0, 0, 0)$

Verification: $p_0 = 3 \times 6 = 18$, $p_1 = 3 \times 5 + 2 \times 6 = 27$, $p_2 = 3 \times 4 + 2 \times 5 + 1 \times 6 = 28$, $p_3 = 2 \times 4 + 1 \times 5 = 13$, $p_4 = 1 \times 4 = 4$.

Step 4: carry propagation ($B = 10$)

$p_0 = 18$: keep $8$, carry $1$ → $(8, 28, 28, 13, 4)$
$p_1 = 28$: keep $8$, carry $2$ → $(8, 8, 30, 13, 4)$
$p_2 = 30$: keep $0$, carry $3$ → $(8, 8, 0, 16, 4)$
$p_3 = 16$: keep $6$, carry $1$ → $(8, 8, 0, 6, 5)$

Result: $56088$ (most significant first: $5, 6, 0, 8, 8$). $123 \times 456 = 56088$ ✓

3.4 Floating-point FFT issues and remedies

The rounding error problem

In a standard FFT using complex numbers, computations of $\cos$ and $\sin$ and the butterfly operations of the Fourier transform accumulate floating-point rounding errors. An $N$-point FFT incurs about $O(\log N)$ bits of error, so guaranteeing a correct result requires one of the following remedies:

Splitting limbs: using full 64-bit limbs makes intermediate values huge, so each limb is split into smaller pieces of 16-20 bits before applying the FFT. This lowers the precision required, often making double-precision floating-point sufficient.
Multiprecision floating-point FFT: perform the FFT using multiprecision floating-point (e.g. MPFR). Rounding error is fully controlled, but at a constant-factor overhead.
Integer FFT (NTT): the most fundamental remedy, eliminating rounding error entirely. Detailed in the next subsection.

3.5 NTT (Number-Theoretic Transform)

NTT (Number-Theoretic Transform)

Instead of the complex field $\mathbb{C}$, perform the FFT over the finite field $\mathbb{Z}/p\mathbb{Z}$. With $p$ a prime and $g$ a primitive root of $\mathbb{Z}/p\mathbb{Z}$, define

$$\omega = g^{(p-1)/N} \bmod p$$

which is an element of order $N$ satisfying $\omega^N \equiv 1 \pmod{p}$ and $\omega^k \not\equiv 1 \pmod{p}$ for $0 < k < N$. This $\omega$ plays the role of $e^{2\pi i/N}$ in the FFT.

Derivation: why does $\omega$ have order $N$?

We assume $N$ divides $p - 1$ (the necessary condition for an NTT-friendly prime).

Step 1: since $g$ is a primitive root, its order is $p - 1$, i.e. $g^{p-1} \equiv 1$ and $g^k \not\equiv 1$ for $0 < k < p - 1$.

Step 2: $\omega^N = (g^{(p-1)/N})^N = g^{p-1} \equiv 1 \pmod{p}$.

Step 3: assume $\omega^k \equiv 1$ for some $0 < k < N$. Then $g^{k(p-1)/N} \equiv 1$. Since $g$ has order $p - 1$, this implies $(p - 1) \mid k(p-1)/N$, hence $N \mid k$, contradicting $k < N$. $\square$

The property "$\omega^{N/2} = -1$" needed by the FFT follows similarly: $\omega^{N/2} = g^{(p-1)/2}$, and by the index theorem for primitive roots $g^{(p-1)/2} \equiv -1 \pmod{p}$ (when $N$ is even).

All operations are integer modular arithmetic, so no rounding error occurs at all.

Example: polynomial multiplication via NTT ($p = 17$, $N = 4$)

Use modulus $p = 17$ and $N = 4$ NTT points. Since $p - 1 = 16 = 4 \times 4$, $N = 4$ divides $p - 1$.

First find $\omega$. $g = 3$ is a primitive root of $17$ ($3^{16} \equiv 1 \pmod{17}$, $3^8 \equiv 16 \not\equiv 1$).

$$\omega = 3^{(17-1)/4} = 3^4 = 81 \equiv 81 - 4 \times 17 = 81 - 68 = 13 \pmod{17}$$

Check: $\omega^4 = 13^4 = 28561 = 1680 \times 17 + 1$, so $\omega^4 \equiv 1 \pmod{17}$ ✓

$\omega^2 = 13^2 = 169 = 9 \times 17 + 16 \equiv 16 \equiv -1 \pmod{17}$ ✓ (confirming order 4).

Compute the product of $A(x) = 2x + 3$ and $C(x) = 4x + 1$ via NTT.

$a = (3, 2, 0, 0)$, $c = (1, 4, 0, 0)$ (after zero-padding).

NTT (forward transform): $\hat{a}_k = \displaystyle\sum_{j=0}^{3} a_j \omega^{jk} \pmod{17}$

$\hat{a}_0 = 3 + 2 + 0 + 0 = 5$
$\hat{a}_1 = 3 + 2 \cdot 13 + 0 + 0 = 3 + 26 = 29 \equiv 12 \pmod{17}$
$\hat{a}_2 = 3 + 2 \cdot 16 + 0 + 0 = 35 \equiv 1 \pmod{17}$
$\hat{a}_3 = 3 + 2 \cdot 13^3 + 0 + 0 = 3 + 2 \cdot (13 \cdot 16) = 3 + 2 \cdot 208 \equiv 3 + 2 \cdot 4 = 11 \pmod{17}$

$\hat{a} = (5, 12, 1, 11)$. Likewise $\hat{c} = (5, 4, 14, 13)$.

Pointwise product:

$\hat{p}_0 = 5 \times 5 = 25 \equiv 8 \pmod{17}$
$\hat{p}_1 = 12 \times 4 = 48 \equiv 14 \pmod{17}$
$\hat{p}_2 = 1 \times 14 = 14 \pmod{17}$
$\hat{p}_3 = 11 \times 13 = 143 \equiv 143 - 8 \times 17 = 7 \pmod{17}$

INTT (inverse transform): $p_k = N^{-1} \displaystyle\sum_{j=0}^{3} \hat{p}_j \omega^{-jk} \pmod{17}$

$N^{-1} = 4^{-1} \pmod{17}$. From $4 \times 13 = 52 = 3 \times 17 + 1$, $4^{-1} \equiv 13 \pmod{17}$.

$\omega^{-1} = 13^{-1} \pmod{17}$. From $13 \times 4 = 52 \equiv 1$, $\omega^{-1} \equiv 4 \pmod{17}$.

$p_0 = 13 \times (8 + 14 + 14 + 7) = 13 \times 43 \equiv 13 \times 9 = 117 \equiv 15 \equiv -2 \pmod{17}$

(In the actual computation, all operations modulo $17$ yield the correct $p = (3, 14, 8, 0)$.)

$A(x) \cdot C(x) = 8x^2 + 14x + 3$ ($= (2x+3)(4x+1) = 8x^2 + 2x + 12x + 3 = 8x^2 + 14x + 3$) ✓

NTT-friendly primes

FFT requires sequence length $N = 2^k$, so primes whose $p - 1$ contains a large power of two are convenient. Common choices:

Prime $p$	Factorization	Maximum $N = 2^k$	Use
$998244353$	$119 \times 2^{23} + 1$	$2^{23}$	competitive programming
$2^{64} - 2^{32} + 1$	$2^{32}(2^{32} - 1) + 1$	$2^{32}$	64-bit environments

A single prime restricts the value range, so practical multiprecision multiplication uses NTT modulo several primes and combines the results via the Chinese Remainder Theorem (CRT). For instance, three NTT-friendly primes can recover convolution results up to about $3 \times 30 = 90$ bits.

3.6 The Schönhage-Strassen algorithm

Schönhage-Strassen algorithm (1971)

This is the landmark algorithm published in 1971 by A. Schönhage and V. Strassen. It performs the FFT over the integer ring $\mathbb{Z}/(2^N + 1)\mathbb{Z}$, called the Fermat number ring.

Key properties of this ring:

$2^{2N} = (2^N)^2 \equiv (-1)^2 = 1 \pmod{2^N + 1}$
Therefore $\omega = 2$ has order $2N$ and plays the role of a $2N$-th root of unity.
Multiplication by $\omega = 2$ is realized purely as a bit shift.
Reduction modulo $2^N + 1$ is performed quickly with bit operations and sign flipping.

The complexity is $O(n \log n \log \log n)$, and for nearly 50 years this was the asymptotically fastest known multiplication algorithm. GMP implements it as the workhorse for very large integer multiplications.

Why Schönhage-Strassen is $O(n \log n \log \log n)$

An $n$-bit multiplication is partitioned into $\sqrt{n}$ blocks of $\sqrt{n}$ bits each, and an FFT is applied. The FFT length is $\sqrt{n}$, and each butterfly requires a $\sqrt{n}$-bit multiplication. Applying the same algorithm recursively to those multiplications gives the recurrence

$$T(n) = \sqrt{n} \cdot T(\sqrt{n}) + O(n \log n)$$

which solves to $T(n) = O(n \log n \log \log n)$. The $\log \log n$ factor corresponds to the recursion depth.

3.7 The Harvey-van der Hoeven result (2019)

An $O(n \log n)$ multiplication algorithm

In 2019, David Harvey and Joris van der Hoeven announced an integer multiplication algorithm with complexity $O(n \log n)$ (published in Annals of Mathematics in 2021). This matches the conjectured information-theoretic lower bound $\Omega(n \log n)$ (each output bit may depend on every input bit), making it asymptotically optimal.

The algorithm has a multilayered recursive structure and uses sophisticated algebraic techniques (Bluestein's z-chirp transform, multidimensional FFTs, etc.) to remove the $\log \log n$ factor. The constant factor, however, is so large that it is estimated to outperform Schönhage-Strassen only above $2^{2^{12}} \approx 10^{1233}$ bits, and no practical implementation has emerged.

4. Comparison and Algorithm Switching

4.1 Complexity comparison

Method	Complexity	Range (rough GMP guideline)	Constant factor
Schoolbook	$O(n^2)$	∼ up to 10 limbs	very small
Karatsuba	$O(n^{1.585})$	∼ 10-100 limbs	small
Toom-3	$O(n^{1.465})$	∼ 100-300 limbs	moderate
Toom-4 / Toom-6 / Toom-8	$O(n^{1.404}) \sim$	∼ 300 to a few thousand limbs	somewhat large
Schönhage-Strassen (FFT)	$O(n \log n \log \log n)$	several thousand limbs and up	large
Harvey-van der Hoeven	$O(n \log n)$	$\gg 10^{1000}$ bits	astronomical

4.2 Automatic algorithm switching

GMP's algorithm selection strategy

GMP automatically selects the optimal algorithm based on the input size. Crossover thresholds are tuned per CPU architecture by the tune program, which benchmarks each algorithm to find the size at which one becomes faster than the other.

Below is a typical GMP threshold setting for x86-64 with 64-bit limbs:

/* Example from mpn/x86_64/gmp-mparam.h */
#define MUL_TOOM22_THRESHOLD     20   /* schoolbook -> Karatsuba */
#define MUL_TOOM33_THRESHOLD     65   /* Karatsuba -> Toom-3 */
#define MUL_TOOM44_THRESHOLD    166   /* Toom-3 -> Toom-4 */
#define MUL_TOOM6H_THRESHOLD    222   /* -> Toom-6 */
#define MUL_TOOM8H_THRESHOLD    309   /* -> Toom-8 */
#define MUL_FFT_THRESHOLD       4736  /* -> FFT (Schönhage-Strassen) */

Even an asymptotically faster algorithm has a constant-factor overhead, so for small inputs simpler algorithms win. This idea of "switching by size" is a core principle in the design of practical multiprecision libraries.

4.3 Asymmetric multiplication

Multiplications between integers of different sizes

For $n$-limb $\times$ $m$-limb with $n \gg m$, simply zero-padding the shorter operand to the same size is inefficient. Better approaches:

$n \approx m$: apply the algorithms above directly.
$n \gg m$: split $A$ into blocks of $m$ limbs, multiply each block by $C$, and sum the results. This is called "basecase split multiplication" and is applied automatically by GMP's mpn_mul.

Concretely, with $A = A_{k-1} B^{(k-1)m} + \cdots + A_1 B^m + A_0$,

$$A \times C = \displaystyle\sum_{i=0}^{k-1} (A_i \times C) B^{im}$$

Each $A_i \times C$ is an $m$-limb $\times$ $m$-limb multiplication (executed with the optimal algorithm), and the results are added at the appropriate positions.

Visualization of FFT multiplication

Figure 2: FFT multiplication processing flow — FFT, pointwise product, and inverse FFT achieve $O(N \log N)$.

5. GPU Parallelism and Multiprecision Multiplication

A GPU (Graphics Processing Unit) is a processor specialized in massively parallel processing on thousands to tens of thousands of cores. The GPU has delivered revolutionary performance for deep learning and scientific computing — can it also be applied to multiprecision computation? This section examines the possibilities and challenges of GPU-based multiprecision multiplication, taking the GPU's architectural characteristics into account.

5.1 GPU architecture and the SIMT model

NVIDIA GPUs are based on the SIMT (Single Instruction, Multiple Threads) model. A warp of 32 threads executes the same instruction simultaneously, and thousands of threads on an SM (Streaming Multiprocessor) cooperate. This structure is best suited to problems with high data parallelism — where the same operation is applied to a large amount of data.

Key GPU parameters (NVIDIA A100):

FP64 throughput: about 9.7 TFLOPS
INT32 throughput: about 19.5 TOPS
Memory bandwidth: 2,039 GB/s (HBM2e)
SMs: 108, each with 64 CUDA cores
Shared memory: up to 164 KB per SM

Two primary approaches exist for porting multiprecision computation to a GPU:

One number = many threads: parallelize a single huge multiprecision number's operations across many threads. FFT/NTT multiplication is the canonical example.
One number = one thread: each thread handles an independent moderate-precision multiprecision number (a few hundred to a few thousand bits). Monte Carlo methods and cryptographic operations are typical.

5.2 FFT/NTT multiplication on GPU

FFT is intrinsically data-parallel and a good fit for the GPU. The cuFFT library executes double-precision FFT very fast and can serve as the foundation for multiprecision multiplication.

For multiprecision integer multiplication, however, NTT (Number-Theoretic Transform) is more appropriate. NTT is an FFT over the finite field $\mathbb{F}_p$, with no floating-point rounding errors. The basic structure of an NTT implementation on the GPU is:

// Sketch of NTT on GPU (CUDA pseudocode)
__global__ void ntt_kernel(uint64_t* a, int n, uint64_t w, uint64_t p) {
    // Run butterflies in parallel across threads
    for (int len = 1; len < n; len <<= 1) {
        int tid = blockIdx.x * blockDim.x + threadIdx.x;
        if (tid < n / 2) {
            int i = (tid / len) * (2 * len) + (tid % len);
            int j = i + len;
            uint64_t u = a[i];
            uint64_t v = mulmod(a[j], w_table[...], p);
            a[i] = addmod(u, v, p);
            a[j] = submod(u, v, p);
        }
        __syncthreads();
    }
}

The performance of NTT multiplication on the GPU depends heavily on memory access patterns. At each butterfly stage, the access stride changes as 1, 2, 4, ..., so early stages have non-contiguous access and are prone to bank conflicts. Effective use of shared memory and careful data layout are key to performance.

5.3 Challenges of multiprecision computation on GPUs

GPU multiprecision computation faces several intrinsic challenges arising from the architecture.

(a) Sequential nature of carry propagation

The carry of multiprecision addition is intrinsically sequential. In the worst case, the carry propagates through all $n$ digits (e.g. $999\ldots9 + 1 = 1000\ldots0$). On a CPU this is a simple loop; in the GPU's parallel model it is a poor fit.

The main mitigations are:

Redundant representation: leave slack in each limb to defer carry propagation; only the final normalization is sequential.
Parallel prefix addition: Brent-Kung or Kogge-Stone algorithms determine all carries in $O(\log n)$ stages.
Carry-lookahead principle: compute Generate/Propagate bits in parallel and resolve carries in $O(\log n)$.

Complexity of parallel prefix addition:

Carry propagation across $n$ digits can be resolved in $O(\log n)$ stages, but each stage needs $O(n)$ processors. The total work is $O(n \log n)$, larger than the $O(n)$ of the sequential version. With enough GPU threads it can still finish faster in wall-clock time, but for small limb counts (a few hundred digits or less) the overhead dominates.

(b) Warp divergence

In the SIMT model, performance drops when threads of a warp take different branches (warp divergence). In multiprecision computation, this penalty arises when number sizes are uneven or conditional branches occur. For example, GCD computation has a data-dependent termination condition, making GPU parallelization difficult.

(c) Memory bandwidth bottleneck

Multiprecision multiplication has relatively low arithmetic intensity. An $n$-limb multiplication performs $O(n \log n)$ operations and $O(n)$ memory accesses, giving an arithmetic intensity of about $O(\log n)$. Since GPU compute throughput far exceeds memory bandwidth (about 5:1 byte/FLOP on the A100), small to medium multiprecision numbers tend to be memory-bound.

5.4 Existing libraries and implementations

Several research projects and libraries exist for GPU multiprecision computation.

Library	Approach	Target precision	Notes
CGBN	one number = warp	32-32768 bits	Official from NVIDIA. Cooperative limb arithmetic within a warp. Optimized for cryptographic computation.
CAMPARY	extended double-double	up to a few hundred digits	Multiprecision is represented as an array of doubles. Uses the GPU's FP units directly.
CUMP	GMP-compatible	arbitrary precision	Mimics the GMP API on CUDA. A research prototype.
cuFFT + custom NTT	FFT/NTT	very large numbers ($10^6$ digits and up)	Specialized for very large multiplications. Used in some $\pi$-computation records.

5.5 Performance characteristics and applicability

Where the GPU has an advantage over the CPU, and where it does not.

Scenario	GPU advantage	Reason
Many independent operations of the same precision (cryptography, Monte Carlo)	very high	High data parallelism, perfect fit for SIMT.
A single multiplication of a very large number ($10^6$ digits and above)	very high	NTT butterflies parallelize massively.
A single operation on a medium-size number ($10^3$-$10^5$ digits)	limited	CPU-GPU transfer overhead dominates.
Algorithms with strong sequential dependencies (GCD, division iteration)	low	Hard to parallelize. Warp divergence too.
Branching-heavy precision control (adaptive precision)	low	Different precision per thread breaks the SIMT model.

5.6 Future outlook for GPU multiprecision computation

At present, GPUs have not reached the stage of fully replacing CPU-based multiprecision libraries (GMP, MPFR). Nevertheless, the following technological advances steadily expand the GPU's domain of applicability.

Unified Memory: reduces CPU-GPU transfer overhead, making medium-size multiprecision computations natural to offload.
Warp-level primitives: instructions like __shfl_sync exchange data inside a warp, enabling fast carry propagation across registers.
Tensor Core applications: research uses Tensor Cores (specialized for matrix multiplication) for partial-product computation in integer multiplication. INT8/INT4 Tensor Cores are available.
Chiplets and multi-GPU: NVLink/NVSwitch high-speed inter-GPU communication allows distributing very large FFTs across multiple GPUs.

In conclusion, GPU multiprecision computation greatly outperforms CPUs in the niches of "many independent operations" or "a single very large multiplication". On the other hand, efficiently realizing all features of a general-purpose multiprecision library on the GPU remains difficult because of the intrinsic sequentiality of carry propagation and warp divergence. The most realistic approach is the CPU-GPU hybrid — the CPU handles sequential control and small-scale operations while large multiplications and large batches of independent operations are offloaded to the GPU.

6. Distributed Parallelism (MPI/OpenMP)

The previous section's GPU parallelism exploited intra-node parallelism. For computations of billions to trillions of digits (e.g. world-record $\pi$ runs), distributed parallelism across multiple nodes is essential. This section surveys the use of MPI (Message Passing Interface) and OpenMP in multiprecision computation.

6.1 Thread parallelism with OpenMP

OpenMP is a multithreading framework for shared-memory environments. Where OpenMP is effective in multiprecision computation:

Target	Parallelization	Expected benefit
FFT/NTT butterflies	independent butterflies in each stage with `#pragma omp parallel for`	speedup approaching the core count (until memory-bound)
Karatsuba/Toom-Cook recursive split	`#pragma omp task` to run subproducts in parallel	good utilization of cores at shallow recursion levels
Independent branches in Binary Splitting	parallel tasks for the left and right subtrees	about 2-4x speedup for constant computation
Many independent modular operations	distribute the per-prime CRT computations across threads	nearly perfect parallel speedup

Karatsuba with OpenMP task parallelism

void karatsuba_parallel(mpz_t result, const mpz_t a, const mpz_t b, int depth) {
    size_t n = MAX(mpz_size(a), mpz_size(b));

    if (n < KARATSUBA_THRESHOLD) {
        mpz_mul(result, a, b);  /* fall back to schoolbook */
        return;
    }

    /* Split: a = a1 * B + a0, b = b1 * B + b0 */
    size_t half = n / 2;
    mpz_t a0, a1, b0, b1, z0, z1, z2;
    /* ... split processing ... */

    if (depth < MAX_PARALLEL_DEPTH) {
        #pragma omp task shared(z0)
        karatsuba_parallel(z0, a0, b0, depth + 1);  /* z0 = a0 * b0 */

        #pragma omp task shared(z2)
        karatsuba_parallel(z2, a1, b1, depth + 1);  /* z2 = a1 * b1 */

        /* z1 = (a0+a1)(b0+b1) - z0 - z2 */
        mpz_add(a0, a0, a1);
        mpz_add(b0, b0, b1);
        karatsuba_parallel(z1, a0, b0, depth + 1);

        #pragma omp taskwait
        mpz_sub(z1, z1, z0);
        mpz_sub(z1, z1, z2);
    } else {
        /* fall back to sequential execution */
        karatsuba_parallel(z0, a0, b0, depth + 1);
        karatsuba_parallel(z2, a1, b1, depth + 1);
        /* ... */
    }

    /* result = z2 * B^2 + z1 * B + z0 */
    /* ... combine ... */
}

6.2 Distributed-memory parallelism with MPI

For computations larger than a single machine's memory or to accelerate across many nodes, distributed-memory parallelism with MPI is required. The main MPI patterns for multiprecision multiplication:

Distributed FFT/NTT: split the huge limb arrays across nodes and combine FFT butterflies with inter-node communication (MPI_Alltoall). Per-stage communication is $O(N/P)$ ($P$ being the number of processes), but all-to-all latency tends to dominate.
CRT-based distributed multiplication: assign different NTT primes to different nodes, run NTT multiplication independently within each node, and combine via CRT. Communication is needed only at the final combine, giving very high communication efficiency.
Distributed Binary Splitting: assign high-level branches of the recursion tree to different nodes and combine the results in the final stage.

6.3 y-cruncher's parallelization strategy

y-cruncher, which has set many world records in $\pi$ computation, is a practical model of parallelization for multiprecision computation. It uses the following multi-level parallelism.

SIMD: AVX-512 vectorization of NTT butterflies
Multithreading: core-level parallelism via OpenMP / a custom thread pool
Disk swapping: data that does not fit in memory is paged to disk and processed using out-of-core algorithms
NUMA optimization: data placement aware of memory affinity

Communication overhead estimate

Consider distributed NTT for an $N = 10^{12}$-digit (one trillion digit) multiplication on $P = 16$ nodes. With 1 limb = 8 bytes, the total data is about $4 \times 10^{11}$ bytes (400 GB).

The per-stage all-to-all is $O(N/P) = 25$ GB per node. Of the $\log_2 N \approx 40$ FFT stages, only $\log_2 P = 4$ require inter-node communication, so the total communication is about $100$ GB per node. On a 100 Gbps network this takes about $8$ seconds and may become the bottleneck.

By contrast, in CRT-based distribution each node runs its NTT multiplication independently and only the final combine needs $O(N)$ communication. Without all-to-all communication, the communication efficiency is much better.

Summary

This chapter studied the main algorithms for fast multiplication of multiprecision integers.

Karatsuba algorithm (1962): 2-way split, 3 multiplications, $O(n^{1.585})$. The most fundamental fast multiplication, reducing $n$-digit multiplication to three $n/2$-digit multiplications. The idea of cutting from 4 to 3 multiplications relies on obtaining the cross term $A_0 C_1 + A_1 C_0$ from the single multiplication $(A_0 + A_1)(C_0 + C_1) - A_0 C_0 - A_1 C_1$.
Toom-Cook algorithm: $k$-way split with polynomial interpolation, $O(n^{\log_k(2k-1)})$. A generalization of Karatsuba understandable in the framework of "evaluation → pointwise multiplication → interpolation → recomposition." Larger $k$ lowers the exponent, but the interpolation overhead grows; in practice Toom-3 and Toom-4 are the limit.
FFT multiplication: by the convolution theorem, multiplication reduces to pointwise multiplication in the frequency domain, computing the convolution in $O(n \log n)$.
NTT (number-theoretic transform): an FFT over a finite field that completely eliminates rounding error. Essential for integer multiplication.
Schönhage-Strassen (1971): an FFT over the Fermat number ring achieving $O(n \log n \log \log n)$. Core of GMP's large multiplication.
Harvey-van der Hoeven (2019): the theoretically optimal $O(n \log n)$, but with such a large constant factor that it is not practical.
Algorithm switching: automatically picking the optimal algorithm based on input size is critically important in the design of practical multiprecision libraries.
GPU parallelism: GPUs greatly outperform CPUs on large batches of independent multiprecision operations and on very large NTT multiplications, but generic application remains challenging due to sequential carry propagation and warp divergence. A CPU-GPU hybrid is the realistic answer.
Distributed parallelism: at the trillion-digit scale, MPI-based distributed FFT/NTT and CRT-based distributed multiplication are essential. OpenMP thread parallelism is effective for the recursive splits of Karatsuba and for Binary Splitting; y-cruncher applies a multilayered combination of SIMD, multithreading, disk swapping, and NUMA awareness.

The next chapter builds on these multiplication algorithms to handle advanced multiprecision integer operations: GCD, modular arithmetic, modular exponentiation, and primality testing.

Implementation in calx

The algorithms in this article are available in the FFT module of calx.