Int Multiplication Algorithms
Overview
Multiplication of Int values automatically selects the optimal algorithm based on the operand word count.
The user never needs to choose an algorithm explicitly.
For small word counts the low-overhead Basecase is used,
and as the size grows the algorithm switches to asymptotically faster methods.
One word = 64 bits, corresponding to roughly 19 decimal digits. For example, a 100-digit number occupies about 6 words, so Basecase is used.
For usage details, see Int API Reference — Arithmetic Operators.
Algorithm Selection and Thresholds
calx determines its thresholds based on empirical benchmarks. The following table shows the algorithm selected for $a \times b$ based on $\min(\text{words}(a), \text{words}(b))$.
| Word count | Algorithm | Complexity |
|---|---|---|
| $< 22$ | Basecase (long multiplication) | $O(n^2)$ |
| $22$–$139$ | Karatsuba | $O(n^{1.585})$ |
| $140$–$199$ | Toom-Cook-3 | $O(n^{1.465})$ |
| $200$–$599$ | Toom-Cook-4 | $O(n^{1.404})$ |
| $600$–$1\,049$ | Toom-Cook-6 | $O(n^{1.338})$ |
| $\geq 1\,050$ | NTT (Schönhage–Strassen) | $O(n \log n \log \log n)$ |
The Toom-Cook-3 range of $140$–$199$ is narrow because
the calx Toom-Cook-4 implementation has a smaller constant factor than Toom-Cook-3.
Toom-Cook-3 splits the input into 3 parts and performs 5 multiplications,
while Toom-Cook-4 splits into 4 parts and performs 7 multiplications.
In terms of multiplications per split, Toom-Cook-4 is more efficient.
Furthermore, calx uses addmul_1-based fixed-size evaluation and
submul_1-based interpolation to greatly reduce
the overhead of Toom-Cook-4.
The thresholds were determined by benchmarks on an AMD Ryzen Threadripper PRO 5995WX (Zen 3) with warm-up and cool-down periods to account for CPU cache state and clock frequency stabilization. Optimal thresholds may differ on other microarchitectures (Intel, ARM, etc.). In the future, calx plans to offer an auto-tuning mechanism that optimizes thresholds on the user's own hardware.
Basecase (Long Multiplication)
This is the same long multiplication algorithm taught in school, operating on 64-bit words. An $n$-word $\times$ $m$-word multiplication consists of $n \times m$ word multiplications and additions.
$$c = a \times b \quad \Rightarrow \quad c_k = \sum_{i+j=k} a_i \cdot b_j + \text{carry}$$
Complexity: $O(nm)$ (where $n$ and $m$ are the respective word counts)
The implementation uses the MULX instruction (BMI2)
to perform 64×64→128-bit multiplication in a single instruction.
The inner loop addmul_1 is optimized in hand-written MASM assembly,
using MULX + ADCX/ADOX to
parallelize the carry chain.
Below 22 words, the overhead of divide-and-conquer exceeds the $O(n^2)$ cost of Basecase, making Basecase the fastest choice.
Related article: Basic Operations on Arbitrary-Precision Integers
Karatsuba Algorithm
A divide-and-conquer method discovered by Karatsuba in 1962. It reduces an $n$-digit multiplication to 3 multiplications of $n/2$ digits plus additions and subtractions.
Split $a$ and $b$ into upper and lower halves:
$$a = a_1 \cdot B + a_0, \quad b = b_1 \cdot B + b_0 \quad (B = 2^{32n})$$
Ordinary long multiplication requires 4 half-size multiplications, but Karatsuba constructs the result from 3 products:
$$z_0 = a_0 \cdot b_0, \quad z_2 = a_1 \cdot b_1, \quad z_1 = (a_0 + a_1)(b_0 + b_1) - z_0 - z_2$$
$$a \cdot b = z_2 \cdot B^2 + z_1 \cdot B + z_0$$
Complexity: $O(n^{\log_2 3}) = O(n^{1.585})$
The trade-off is fewer multiplications (3 instead of 4) at the cost of more additions and subtractions. At small sizes the overhead outweighs the savings. calx switches to Karatsuba at 22 words or more.
Related article: Fast Multiplication Algorithms
Toom-Cook-3
Splits the input into 3 parts and reconstructs the result from 5 partial products. This is a generalization of Karatsuba: for a $k$-way split, $2k-1$ multiplications suffice.
The inputs are viewed as polynomials $a = a_2 x^2 + a_1 x + a_0$, $b = b_2 x^2 + b_1 x + b_0$ ($x = B^{n/3}$), and multiplication is treated as polynomial multiplication. Values are evaluated at 5 points ($0, 1, -1, 2, \infty$), and the coefficients are recovered by interpolation from the partial products.
Complexity: $O(n^{\log_3 5}) = O(n^{1.465})$
The overhead of the interpolation step (divisions and additions/subtractions) is larger than Karatsuba's, so it is disadvantageous at smaller sizes. calx switches at 140 words or more.
Related article: Fast Multiplication Algorithms — Toom-Cook Method
Toom-Cook-4
Splits the input into 4 parts and reconstructs the result from 7 partial products.
The inputs are represented as degree-3 polynomials, and values are evaluated at 7 points. The evaluation points are $0, 1, -1, 2, -2, 3, \infty$.
Complexity: $O(n^{\log_4 7}) = O(n^{1.404})$
Implementation Optimizations
The calx Toom-Cook-4 implementation incorporates the following optimizations:
- addmul_1-based evaluation:
Polynomial values at each evaluation point are computed using
fixed $k$-limb
addmul_1-based batch processing, reducing temporary buffer usage andnormalized_sizecalls. - submul_1-based interpolation:
Divisions in the interpolation step are reduced to
submul_1and exact division by small constants such asdivexact_by3anddivexact_by5, completely avoiding general-purpose division.
These optimizations give Toom-Cook-4 a large constant-factor advantage over Toom-Cook-3, which is why the Toom-Cook-3 range of $140$–$199$ words is so narrow. Toom-Cook-4 is used in the $200$–$599$ word range.
Related article: Fast Multiplication Algorithms — Toom-Cook Method
Toom-Cook-6
Splits the input into 6 parts and reconstructs the result from 11 partial products. The evaluation points are $0, \pm1, \pm2, \pm3, \pm4, \pm5, \infty$ (12 points; since $-0 = 0$, this yields 11 multiplications).
Complexity: $O(n^{\log_6 11}) = O(n^{1.338})$
Asymptotically faster than Toom-Cook-4 ($O(n^{1.404})$), it is used in the $600$–$1{,}049$ word range.
Like Toom-Cook-4, it employs addmul_1-based evaluation and exact division by small constants for interpolation.
Above $1{,}050$ words NTT surpasses Toom-Cook-6, so its effective range is relatively narrow.
NTT (Schönhage–Strassen)
A convolution-based multiplication using the Number Theoretic Transform (NTT). Published by Schönhage and Strassen in 1971. It realizes integer multiplication as "polynomial multiplication → pointwise products in the transform domain via the convolution theorem → inverse transform."
Ordinary FFT operates over complex numbers and thus incurs rounding errors, but NTT operates over the residue ring $\mathbb{Z}/(B^F+1)$, so no rounding errors occur at all. This is why NTT is well-suited for integer multiplication.
Complexity: $O(n \log n \log \log n)$
Residue Ring $\mathbb{Z}/(B^F+1)$
The NTT in calx operates over a residue ring modulo the Fermat number $B^F + 1$ ($B = 2^{64}$). The input is split into $M = 2^l$ pieces, each of $K$ words, with arithmetic performed in a ring of $F$ words. $\omega = B^{2F/M}$ is used as a primitive $M$-th root of unity, so twiddle factor multiplication requires only bit shifts by powers of 2.
Piecewise Cost Model
NTT performance depends heavily on the choice of transform length $M$ and ring size $F$. calx uses a piecewise cost model to select optimal parameters:
- Pointwise multiplication cost: The actual cost of the multiplication algorithm (Basecase/Karatsuba/Toom-Cook) used for a given $F$ is estimated piecewise.
- Butterfly cost: The cost of $M \cdot l$ butterfly operations (addition, subtraction, and bit shift mod $B^F+1$).
- The best parameters are searched around an initial estimate of $l$ (where $M \approx \sqrt{N}$).
Other Implementation Details
- thread_local buffers: Working memory is kept in
thread_localstorage to avoid memory allocation/deallocation costs on every call. - Quantization of F: When $M > 128$, $F$ is rounded up to a multiple of $M/128$ so that twiddle factor bit shifts land on integer word boundaries.
The switch to NTT occurs at $1{,}050$ words or more (approximately $43{,}000$ digits). In the $1{,}050$–$1{,}249$ word range, Double FFT is tried first; at $1{,}250$ words or more, Prime NTT is used directly. For numbers with hundreds of thousands of digits or more, NTT vastly outperforms all other algorithms.
Related article: Fast Multiplication Algorithms — FFT Multiplication
Unbalanced Multiplication
When the operands differ greatly in size (e.g., 1,000 words × 100 words), the larger operand is split into blocks of the same size as the smaller one, and the products of each block are summed — an iterative block multiplication approach.
$$a \cdot b = \sum_{i=0}^{k-1} a_i \cdot b \cdot B^{i \cdot m} \quad (a_i \text{ is an } m\text{-word block})$$
calx switches to unbalanced multiplication when $\text{words}(a) \geq 2 \cdot \text{words}(b)$. Each $a_i \cdot b$ is a balanced multiplication for which the optimal algorithm is applied. This ensures efficient processing even for unbalanced inputs.
Related article: Unbalanced Toom-Cook Multiplication
Squaring Optimizations
Squaring ($a \times a$) differs from general multiplication in that both operands are identical, allowing symmetry to be exploited.
Basecase Squaring
Since $a_i \cdot a_j$ and $a_j \cdot a_i$ are equal, each cross term with $i \neq j$ only needs to be computed once and doubled. This reduces the number of multiplications to roughly $n^2/2$, yielding approximately 30% speedup.
$$a^2 = \sum_i a_i^2 \cdot B^{2i} + 2 \sum_{i < j} a_i \cdot a_j \cdot B^{i+j}$$
Squaring in Fast Algorithms
Karatsuba, Toom-Cook, and NTT all have dedicated squaring variants. In Karatsuba squaring, all three partial products in the recursive calls are themselves squarings, so the symmetry advantage propagates recursively.
Squaring Thresholds
Because squaring is faster than general multiplication, the algorithm-switching thresholds differ.
In calx, SQR_TOOMCOOK3_THRESHOLD = 200,
which is higher than the general multiplication threshold (140), giving Karatsuba a wider range.
Since there is no dedicated Toom-Cook-4 squaring implementation,
Toom-Cook-3 (sqr) transitions directly to NTT (sqr).
Karatsuba (sqr) continues to be used for squaring in the $140$–$199$ word range because
the symmetry reduction of Basecase squaring ($n^2/2$ multiplications) propagates into the Karatsuba recursion,
making Karatsuba (sqr) still faster than Toom-Cook-3 (sqr) in this range.
| Word count | General multiplication | Squaring |
|---|---|---|
| $< 22$ | Basecase | Basecase (sqr) |
| $22$–$139$ | Karatsuba | Karatsuba (sqr) |
| $140$–$199$ | Toom-Cook-3 | Karatsuba (sqr) |
| $200$–$2{,}599$ | Toom-Cook-4 | |
| $200$–$2\,999$ | Toom-Cook-4 | Toom-Cook-3 (sqr) and above |
| $\geq 3\,000$ | NTT | NTT (sqr) |
Related article: Dedicated Squaring Optimizations
Design Decisions
Why Six Tiers
GMP implements Toom-Cook-6, 7, 8, and even higher-order splits, but calx uses five levels of divide-and-conquer (Basecase, Karatsuba, Toom-3, Toom-4, Toom-6) plus NTT for a total of six tiers. Toom-Cook-5 is skipped because its interpolation matrix is complex yet offers little advantage over Toom-Cook-6.
- Toom-Cook-6 is 5--10% faster than Toom-Cook-4 in the $600$–$1{,}049$ word range
- Beyond the NTT threshold ($1{,}050$ words $\approx$ $43{,}000$ digits), NTT is overwhelmingly fast
- Toom-Cook-7 and 8 would have even narrower effective ranges, making the implementation effort not worthwhile
Why NTT Instead of FFT
Floating-point FFT-based multiplication requires complex rounding-error management, and as digit counts grow, accumulated errors make correctness guarantees difficult. NTT operates over the residue ring $\mathbb{Z}/(B^F+1)$, incurring no rounding errors, so the correctness of results is mathematically guaranteed. This property is essential for arbitrary-precision integer multiplication.
References
- Karatsuba, A. and Ofman, Yu. (1962). "Multiplication of Multidigit Numbers on Automata". Doklady Akademii Nauk SSSR, 145, 293–294.
- Toom, A. L. (1963). "The Complexity of a Scheme of Functional Elements Simulating the Multiplication of Integers". Doklady Akademii Nauk SSSR, 150 (3), 496–498.
- Schönhage, A. and Strassen, V. (1971). "Schnelle Multiplikation großer Zahlen". Computing, 7, 281–292.
- Knuth, D. E. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. 3rd ed. Addison-Wesley, 1997.