Which algorithms does calx use for arbitrary-precision integer multiplication?

calx automatically selects one of six algorithms based on the operand word count. Below 22 words it uses Basecase (long multiplication), 22–139 uses Karatsuba, 140–199 uses Toom-Cook-3, 200–599 uses Toom-Cook-4, 600–1,049 uses Toom-Cook-6, and 1,050 words or more uses NTT (Schönhage–Strassen).

What is NTT (Schönhage–Strassen) multiplication?

It is a convolution-based multiplication algorithm that uses the Number Theoretic Transform. Because it operates over the residue ring Z/(B^F+1), it incurs no rounding errors unlike floating-point FFT, and its complexity is O(n log n log log n). In calx it is used for 1,050 words (approximately 43,000 digits) or more.

What is the complexity of the Karatsuba algorithm?

The Karatsuba algorithm has a complexity of O(n^1.585). It is a divide-and-conquer method that splits the input in half and reduces four multiplications to three. It was discovered by Karatsuba in 1962.

Is squaring faster than general multiplication?

Yes. Because both operands are identical, the cross terms a_i·a_j and a_j·a_i only need to be computed once and doubled. In Basecase this roughly halves the number of multiplications, yielding approximately 30% speedup. Karatsuba and NTT also have specialized squaring variants.

How are the multiplication algorithm thresholds determined in calx?

They are determined by benchmarks on an AMD Ryzen Threadripper PRO 5995WX (Zen 3) with warm-up and cool-down periods. In the future, calx plans to offer an auto-tuning mechanism that optimizes thresholds on the user's own hardware.

Int Multiplication Algorithms

Overview

Multiplication of Int values automatically selects the optimal algorithm based on the operand word count. The user never needs to choose an algorithm explicitly. For small word counts the low-overhead Basecase is used, and as the size grows the algorithm switches to asymptotically faster methods.

One word = 64 bits, corresponding to roughly 19 decimal digits. For example, a 100-digit number occupies about 6 words, so Basecase is used.

For usage details, see Int API Reference — Arithmetic Operators.

Algorithm Selection and Thresholds

calx determines its thresholds based on empirical benchmarks. The following table shows the algorithm selected for $a \times b$ based on $\min(\text{words}(a), \text{words}(b))$.

Word count	Algorithm	Complexity
$< 22$	Basecase (long multiplication)	$O(n^2)$
$22$–$139$	Karatsuba	$O(n^{1.585})$
$140$–$199$	Toom-Cook-3	$O(n^{1.465})$
$200$–$599$	Toom-Cook-4	$O(n^{1.404})$
$600$–$1\,049$	Toom-Cook-6	$O(n^{1.338})$
$\geq 1\,050$	NTT (Schönhage–Strassen)	$O(n \log n \log \log n)$

The Toom-Cook-3 range of $140$–$199$ is narrow because the calx Toom-Cook-4 implementation has a smaller constant factor than Toom-Cook-3. Toom-Cook-3 splits the input into 3 parts and performs 5 multiplications, while Toom-Cook-4 splits into 4 parts and performs 7 multiplications. In terms of multiplications per split, Toom-Cook-4 is more efficient. Furthermore, calx uses addmul_1-based fixed-size evaluation and submul_1-based interpolation to greatly reduce the overhead of Toom-Cook-4.

The thresholds were determined by benchmarks on an AMD Ryzen Threadripper PRO 5995WX (Zen 3) with warm-up and cool-down periods to account for CPU cache state and clock frequency stabilization. Optimal thresholds may differ on other microarchitectures (Intel, ARM, etc.). In the future, calx plans to offer an auto-tuning mechanism that optimizes thresholds on the user's own hardware.

Basecase (Long Multiplication)

This is the same long multiplication algorithm taught in school, operating on 64-bit words. An $n$-word $\times$ $m$-word multiplication consists of $n \times m$ word multiplications and additions.

$$c = a \times b \quad \Rightarrow \quad c_k = \sum_{i+j=k} a_i \cdot b_j + \text{carry}$$

Complexity: $O(nm)$ (where $n$ and $m$ are the respective word counts)

The implementation uses the MULX instruction (BMI2) to perform 64×64→128-bit multiplication in a single instruction. The inner loop addmul_1 is optimized in hand-written MASM assembly, using MULX + ADCX/ADOX to parallelize the carry chain.

Below 22 words, the overhead of divide-and-conquer exceeds the $O(n^2)$ cost of Basecase, making Basecase the fastest choice.

Karatsuba Algorithm

A divide-and-conquer method discovered by Karatsuba in 1962. It reduces an $n$-digit multiplication to 3 multiplications of $n/2$ digits plus additions and subtractions.

Split $a$ and $b$ into upper and lower halves:

$$a = a_1 \cdot B + a_0, \quad b = b_1 \cdot B + b_0 \quad (B = 2^{32n})$$

Ordinary long multiplication requires 4 half-size multiplications, but Karatsuba constructs the result from 3 products:

$$z_0 = a_0 \cdot b_0, \quad z_2 = a_1 \cdot b_1, \quad z_1 = (a_0 + a_1)(b_0 + b_1) - z_0 - z_2$$

$$a \cdot b = z_2 \cdot B^2 + z_1 \cdot B + z_0$$

Complexity: $O(n^{\log_2 3}) = O(n^{1.585})$

The trade-off is fewer multiplications (3 instead of 4) at the cost of more additions and subtractions. At small sizes the overhead outweighs the savings. calx switches to Karatsuba at 22 words or more.

Related article: Fast Multiplication Algorithms

Toom-Cook-3

Splits the input into 3 parts and reconstructs the result from 5 partial products. This is a generalization of Karatsuba: for a $k$-way split, $2k-1$ multiplications suffice.

The inputs are viewed as polynomials $a = a_2 x^2 + a_1 x + a_0$, $b = b_2 x^2 + b_1 x + b_0$ ($x = B^{n/3}$), and multiplication is treated as polynomial multiplication. Values are evaluated at 5 points ($0, 1, -1, 2, \infty$), and the coefficients are recovered by interpolation from the partial products.

Complexity: $O(n^{\log_3 5}) = O(n^{1.465})$

The overhead of the interpolation step (divisions and additions/subtractions) is larger than Karatsuba's, so it is disadvantageous at smaller sizes. calx switches at 140 words or more.

Toom-Cook-4

Splits the input into 4 parts and reconstructs the result from 7 partial products.

The inputs are represented as degree-3 polynomials, and values are evaluated at 7 points. The evaluation points are $0, 1, -1, 2, -2, 3, \infty$.

Complexity: $O(n^{\log_4 7}) = O(n^{1.404})$

Implementation Optimizations

The calx Toom-Cook-4 implementation incorporates the following optimizations:

addmul_1-based evaluation: Polynomial values at each evaluation point are computed using fixed $k$-limb addmul_1-based batch processing, reducing temporary buffer usage and normalized_size calls.
submul_1-based interpolation: Divisions in the interpolation step are reduced to submul_1 and exact division by small constants such as divexact_by3 and divexact_by5, completely avoiding general-purpose division.

These optimizations give Toom-Cook-4 a large constant-factor advantage over Toom-Cook-3, which is why the Toom-Cook-3 range of $140$–$199$ words is so narrow. Toom-Cook-4 is used in the $200$–$599$ word range.

Toom-Cook-6

Splits the input into 6 parts and reconstructs the result from 11 partial products. The evaluation points are $0, \pm1, \pm2, \pm3, \pm4, \pm5, \infty$ (12 points; since $-0 = 0$, this yields 11 multiplications).

Complexity: $O(n^{\log_6 11}) = O(n^{1.338})$

Asymptotically faster than Toom-Cook-4 ($O(n^{1.404})$), it is used in the $600$–$1{,}049$ word range. Like Toom-Cook-4, it employs addmul_1-based evaluation and exact division by small constants for interpolation.

Above $1{,}050$ words NTT surpasses Toom-Cook-6, so its effective range is relatively narrow.

NTT (Schönhage–Strassen)

A convolution-based multiplication using the Number Theoretic Transform (NTT). Published by Schönhage and Strassen in 1971. It realizes integer multiplication as "polynomial multiplication → pointwise products in the transform domain via the convolution theorem → inverse transform."

Ordinary FFT operates over complex numbers and thus incurs rounding errors, but NTT operates over the residue ring $\mathbb{Z}/(B^F+1)$, so no rounding errors occur at all. This is why NTT is well-suited for integer multiplication.

Complexity: $O(n \log n \log \log n)$

Residue Ring $\mathbb{Z}/(B^F+1)$

The NTT in calx operates over a residue ring modulo the Fermat number $B^F + 1$ ($B = 2^{64}$). The input is split into $M = 2^l$ pieces, each of $K$ words, with arithmetic performed in a ring of $F$ words. $\omega = B^{2F/M}$ is used as a primitive $M$-th root of unity, so twiddle factor multiplication requires only bit shifts by powers of 2.

Piecewise Cost Model

NTT performance depends heavily on the choice of transform length $M$ and ring size $F$. calx uses a piecewise cost model to select optimal parameters:

Pointwise multiplication cost: The actual cost of the multiplication algorithm (Basecase/Karatsuba/Toom-Cook) used for a given $F$ is estimated piecewise.
Butterfly cost: The cost of $M \cdot l$ butterfly operations (addition, subtraction, and bit shift mod $B^F+1$).
The best parameters are searched around an initial estimate of $l$ (where $M \approx \sqrt{N}$).

Other Implementation Details

thread_local buffers: Working memory is kept in thread_local storage to avoid memory allocation/deallocation costs on every call.
Quantization of F: When $M > 128$, $F$ is rounded up to a multiple of $M/128$ so that twiddle factor bit shifts land on integer word boundaries.

The switch to NTT occurs at $1{,}050$ words or more (approximately $43{,}000$ digits). In the $1{,}050$–$1{,}249$ word range, Double FFT is tried first; at $1{,}250$ words or more, Prime NTT is used directly. For numbers with hundreds of thousands of digits or more, NTT vastly outperforms all other algorithms.

Unbalanced Multiplication

When the operands differ greatly in size (e.g., 1,000 words × 100 words), the larger operand is split into blocks of the same size as the smaller one, and the products of each block are summed — an iterative block multiplication approach.

$$a \cdot b = \sum_{i=0}^{k-1} a_i \cdot b \cdot B^{i \cdot m} \quad (a_i \text{ is an } m\text{-word block})$$

calx switches to unbalanced multiplication when $\text{words}(a) \geq 2 \cdot \text{words}(b)$. Each $a_i \cdot b$ is a balanced multiplication for which the optimal algorithm is applied. This ensures efficient processing even for unbalanced inputs.

Related article: Unbalanced Toom-Cook Multiplication

Squaring Optimizations

Squaring ($a \times a$) differs from general multiplication in that both operands are identical, allowing symmetry to be exploited.

Basecase Squaring

Since $a_i \cdot a_j$ and $a_j \cdot a_i$ are equal, each cross term with $i \neq j$ only needs to be computed once and doubled. This reduces the number of multiplications to roughly $n^2/2$, yielding approximately 30% speedup.

$$a^2 = \sum_i a_i^2 \cdot B^{2i} + 2 \sum_{i < j} a_i \cdot a_j \cdot B^{i+j}$$

Squaring in Fast Algorithms

Karatsuba, Toom-Cook, and NTT all have dedicated squaring variants. In Karatsuba squaring, all three partial products in the recursive calls are themselves squarings, so the symmetry advantage propagates recursively.

Squaring Thresholds

Because squaring is faster than general multiplication, the algorithm-switching thresholds differ. In calx, SQR_TOOMCOOK3_THRESHOLD = 200, which is higher than the general multiplication threshold (140), giving Karatsuba a wider range. Since there is no dedicated Toom-Cook-4 squaring implementation, Toom-Cook-3 (sqr) transitions directly to NTT (sqr). Karatsuba (sqr) continues to be used for squaring in the $140$–$199$ word range because the symmetry reduction of Basecase squaring ($n^2/2$ multiplications) propagates into the Karatsuba recursion, making Karatsuba (sqr) still faster than Toom-Cook-3 (sqr) in this range.

Word count	General multiplication	Squaring
$< 22$	Basecase	Basecase (sqr)
$22$–$139$	Karatsuba	Karatsuba (sqr)
$140$–$199$	Toom-Cook-3	Karatsuba (sqr)
$200$–$2{,}599$	Toom-Cook-4	Karatsuba (sqr)
$200$–$2\,999$	Toom-Cook-4	Toom-Cook-3 (sqr) and above
$\geq 3\,000$	NTT	NTT (sqr)

Related article: Dedicated Squaring Optimizations

Design Decisions

Why Six Tiers

GMP implements Toom-Cook-6, 7, 8, and even higher-order splits, but calx uses five levels of divide-and-conquer (Basecase, Karatsuba, Toom-3, Toom-4, Toom-6) plus NTT for a total of six tiers. Toom-Cook-5 is skipped because its interpolation matrix is complex yet offers little advantage over Toom-Cook-6.

Toom-Cook-6 is 5--10% faster than Toom-Cook-4 in the $600$–$1{,}049$ word range
Beyond the NTT threshold ($1{,}050$ words $\approx$ $43{,}000$ digits), NTT is overwhelmingly fast
Toom-Cook-7 and 8 would have even narrower effective ranges, making the implementation effort not worthwhile

Why NTT Instead of FFT

Floating-point FFT-based multiplication requires complex rounding-error management, and as digit counts grow, accumulated errors make correctness guarantees difficult. NTT operates over the residue ring $\mathbb{Z}/(B^F+1)$, incurring no rounding errors, so the correctness of results is mathematically guaranteed. This property is essential for arbitrary-precision integer multiplication.

References

Karatsuba, A. and Ofman, Yu. (1962). "Multiplication of Multidigit Numbers on Automata". Doklady Akademii Nauk SSSR, 145, 293–294.
Toom, A. L. (1963). "The Complexity of a Scheme of Functional Elements Simulating the Multiplication of Integers". Doklady Akademii Nauk SSSR, 150 (3), 496–498.
Schönhage, A. and Strassen, V. (1971). "Schnelle Multiplikation großer Zahlen". Computing, 7, 281–292.
Knuth, D. E. The Art of Computer Programming, Volume 2: Seminumerical Algorithms. 3rd ed. Addison-Wesley, 1997.