Chapter 5: Fast Division Algorithms

Knuth D, Burnikel-Ziegler, and Newton Iteration

Multiprecision integer division is intrinsically harder than multiplication. Multiplication is determined by sums of products at all positions and parallelizes well; division has a sequential structure of "guess each quotient digit and correct if wrong." Even highly tuned implementations typically have division running 2-5 times slower than multiplication.

This chapter starts from naive schoolbook division (Knuth Algorithm D), proceeds to divide-and-conquer (Burnikel-Ziegler), then Newton reciprocal iteration which reduces division to multiplication, and finally the special case of Exact Division where the remainder is known to be zero. We study the calx implementation throughout and quantify which algorithm is fastest at each size.

5.1 Schoolbook Division (Knuth Algorithm D)

Let $A = (a_{n-1}, \ldots, a_0)$ be an $n$-word dividend and $B = (b_{m-1}, \ldots, b_0)$ an $m$-word divisor. We compute the quotient $Q$ and remainder $R$ such that $A = QB + R$ with $0 \le R < B$. Algorithm D from Knuth's The Art of Computer Programming Vol. 2, §4.3.1 is the foundation of all fast division algorithms.

Overview of Algorithm D

Normalize: left-shift $A$ and $B$ equally so that the most significant bit of $b_{m-1}$ becomes 1.
Iterate for $j = n-m, n-m-1, \ldots, 0$:
1. Guess a trial quotient $\hat{q}$ from the top two words.
2. Compute $A \leftarrow A - \hat{q} \cdot B \cdot \beta^j$ where $\beta = 2^{64}$.
3. If the result is negative, set $\hat{q} \leftarrow \hat{q} - 1$ and $A \leftarrow A + B \cdot \beta^j$ to correct.
4. Record $q_j \leftarrow \hat{q}$.
Denormalize: right-shift the remainder back to the original scale.

Complexity is $O(nm)$ word multiplications. For balanced division with $n = m$ it is $O(n^2)$, identical in substance to long division as taught in school.

Why normalize?

When the leading bit of $b_{m-1}$ is 1, the trial quotient's error stays within the range $\{-2, -1, 0\}$ (Knuth's Theorem B). Therefore at most two corrections are needed, and statistically corrections happen with probability only about $10^{-19}$. Without normalization, the error gets much larger and corrections fire often, degrading performance.

In calx, div_basecase / sbpi1_div_qr in MpnOps.hpp implements this step. The basic structure:

// Sketch of div_basecase in MpnOps.hpp
inline void div_basecase(uint64_t* q, uint64_t* a, size_t an,
                         const uint64_t* b, size_t bn) {
    const uint64_t d1 = b[bn - 1];   // MSB = 1 (already normalized)
    const uint64_t d0 = b[bn - 2];
    // Pre-compute dinv = floor((beta^2 - 1) / d1) - beta  (Möller-Granlund)
    // Loop: at each j, do 3-by-2 quotient estimation + addmul_1 subtraction + correction
}

5.2 Quotient Estimation — 2-by-1 and 3-by-2

The inner loop of Algorithm D reduces to "guess $\hat{q}$ from the top 3 words of $A$ and top 1-2 words of $B$." Since most of the running time is spent inside this tiny operation, optimizing it dictates overall performance.

2-by-1 estimation

The simplest approach is the quotient of the top two words:

$$\hat{q} = \min\!\left(\left\lfloor \frac{u_{n+1}\beta + u_n}{v_{n-1}} \right\rfloor,\; \beta - 1\right).$$

This is one hardware DIV instruction. The x86-64 DIV r64 performs a 128-by-64-bit division in roughly 20-30 cycles.

3-by-2 estimation (Möller-Granlund)

More accurate is to use three words. Möller and Granlund (2011) gave a way to eliminate the hardware division entirely, using a pre-computed reciprocal $\text{dinv}$:

$$\text{dinv} = \left\lfloor \frac{\beta^2 - 1}{d_1} \right\rfloor - \beta, \quad (q_1, q_0) = u_2 \cdot \text{dinv} + (u_2, u_1).$$

This MULX + add estimates $\hat{q}$ in just 2-3 cycles. The pre-computed $\text{dinv}$ can be reused as long as the divisor is fixed, which is a huge win for long divisions.

// calx: 3-by-2 quotient estimation in MpnOps.hpp
inline void udiv_qrnnd_3by2(uint64_t& q, uint64_t& r1, uint64_t& r0,
                            uint64_t u2, uint64_t u1, uint64_t u0,
                            uint64_t d1, uint64_t d0, uint64_t dinv) {
    uint64_t p_hi;
    uint64_t p_lo = _umul128(u2, dinv, &p_hi);
    uint64_t q0 = p_lo + u1;
    uint64_t q1 = p_hi + u2 + (q0 < p_lo);
    // Followed by at most 2 conditional corrections of the quotient
}

Bound on the estimation error

The value $\text{dinv} = \lfloor (\beta^2 - 1)/d_1 \rfloor - \beta$ can be seen as a $\beta$-adic reciprocal of $d_1$ in the form $1 + \text{dinv}/\beta$. Multiplying it against a three-word dividend keeps the deviation from the true quotient within a narrow carry-sized band. For a normalized $B$, the 3-by-2 estimation error $\hat{q} - q$ stays in $\{0, 1\}$ only (Möller-Granlund, Theorem 1), which is tighter than the $\{0, 1, 2\}$ of 2-by-1 estimation, giving a simpler correction loop.

5.3 Burnikel-Ziegler Recursive Division

Algorithm D requires $O(n^2)$ word multiplications. When $n$ reaches a few hundred words, even a fast multiplier underneath cannot save the $O(n^2)$ outer cost. Burnikel and Ziegler (1998) brought division down to $O(M(n) \log n)$ via divide-and-conquer, where $M(n)$ is the cost of $n$-word multiplication (Karatsuba or better).

The Div-2n-by-n pattern

Assume $A$ has $2n$ words, $B$ has $n$ words, and the quotient of $A / B$ fits in $n$ words. Split $B$ and $A$ into $n/2$-word pieces:

$$B = B_1 \beta^{n/2} + B_0, \quad A = A_3 \beta^{3n/2} + A_2 \beta^{n} + A_1 \beta^{n/2} + A_0.$$

Then the quotient can be found by solving two smaller recursions:

Upper half $Q_1$: divide the top $3n/2$ words of $A$ by $B$ (recursive Div-$3n/2$-by-$n/2$).
Lower half $Q_0$: form an intermediate remainder and recurse on the lower half of $A$.
Combine: $Q = Q_1 \beta^{n/2} + Q_0$.

Each step involves a half-size division plus one $n/2 \times n$ multiplication on the result. If the multiplication cost is $M(n)$:

$$D(2n, n) = 2\, D(n, n/2) + O(M(n)).$$

By the master theorem, $D(n) = O(M(n) \log n)$. With $M(n) = O(n^{1.465})$ (Toom-3) or $O(n \log n)$ (FFT), division is only $\log n$ times slower than multiplication.

The calx implementation

// MpnOps.hpp: dcpi1_div_qr_n (divide-and-conquer, pre-inverted)
inline uint64_t dcpi1_div_qr_n(uint64_t* qp, uint64_t* np,
                               const uint64_t* dp, size_t n,
                               uint64_t dinv, uint64_t* scratch) {
    if (n < DC_DIV_QR_THRESHOLD) {
        return sbpi1_div_qr(qp, np, 2 * n, dp, n, dinv); // base: schoolbook
    }
    size_t lo = n / 2, hi = n - lo;
    // Upper half: 2hi-by-hi recursion
    uint64_t qh = dcpi1_div_qr_n(qp + lo, np + 2 * lo, dp + lo, hi, dinv, scratch);
    // Correction: subtract Q1 * B_low, then recurse on lower half
    // ...
}

calx uses DC_DIV_QR_THRESHOLD = 20 as the base-case size and BZ_THRESHOLD = 64 as the switchover between schoolbook and BZ.

5.4 Newton Reciprocal Iteration

Split division $A / B$ into two steps:

Compute a reciprocal approximation $X \approx 1/B$.
Take $A \cdot X$ as the quotient, then correct any error.

With this approach, division reduces to multiplication. If multiplication $M(n)$ is $O(n \log n)$ with FFT, so is division. Newton iteration is the engine of this strategy.

Newton iteration formula

Apply Newton's method to $f(x) = 1/x - B$, whose root is $x = 1/B$:

$$x_{k+1} = x_k - \frac{f(x_k)}{f'(x_k)} = x_k (2 - B x_k).$$

When $x_k$ is close to $1/B$, we have $B x_k \approx 1$ and $2 - B x_k \approx 1$, and the error of $x_{k+1}$ is proportional to the square of the error of $x_k$ — quadratic convergence. Starting from a double-precision initial guess (48 bits of accuracy), reaching $b$ bits takes only $\lceil \log_2(b / 48) \rceil$ iterations.

Cost management per iteration

If $x_k$ has $p_k$ bits of precision at step $k$, the product $B x_k$ need only be computed at $p_k$ bits. Each iteration costs $M(p_k)$, summing as a geometric series:

$$\sum_{k=0}^{K} M(p_k) \le M(n) \cdot \sum_{k=0}^{\infty} 2^{-k} = 2 M(n).$$

That is, the reciprocal costs only about twice a single full-precision multiplication. This is why Newton iteration keeps division in the same complexity class as multiplication.

The calx implementation for Float division

calx switches to Newton iteration for Float division when the divisor has 64 words (~4096 bits) or more. The initial guess is via double precision, then each iteration doubles the precision.

// Float.cpp: initial approximation for Newton division
int64_t b_bl = static_cast<int64_t>(b_work.mantissa_.bitLength());
int64_t exp_adj = b_work.exponent_ + b_bl - 1;  // b_norm in [1, 2)
b_work.exponent_ = -(b_bl - 1);
double b_d = b_work.toDouble();                  // 53-bit approximation
Float x(1.0 / b_d);
x.exponent_ -= exp_adj;                          // Rescale to original magnitude
x.effective_bits_ = 64;                          // Record achieved precision

// Newton iteration body (b_work is the normalized divisor)
int correct_bits = 48;
while (correct_bits < target_bits) {
    Float bx = b_work * x;                       // ~ 1
    Float residual = one - bx;                   // ~ 2^{-correct_bits}
    x = x + x * residual;                        // x_{k+1} = x_k (2 - B x_k)
    correct_bits = std::min(2 * correct_bits, target_bits);
}

Precision tracking (effective_bits)

Tracking precision in Newton iteration is subtle. The target precision $p_k$ at step $k$ is not necessarily the effective precision. Rounding direction, error propagation, and cancellation in subtraction all affect it.

In calx each Float carries an effective_bits_ field, updated at each iteration by min(2 * correct_bits, step_bits). The final quotient includes a single error-aware correction:

$$Q \approx A \cdot x, \quad \text{residual} = 1 - B \cdot x, \quad Q_{\text{corrected}} = Q + Q \cdot \text{residual}.$$

This is a fused correction that saves one final $M(P, P)$ multiplication while still squaring the error.

5.5 Exact Division

Division where the remainder is known in advance to be zero can be treated specially. Examples: polynomial division $A = QB$, binomial coefficient integrality $\binom{n}{k} = n! / (k!(n-k)!)$, Toom-Cook interpolation with fixed small divisors (/3, /5, /11 etc.) — all are exact divisions.

Jebelean's exact division

Normal division without discarding the remainder computes the quotient from the top. For exact division, computing the quotient from the bottom gives the same answer. Jebelean (1993) observed that by using the $p$-adic inverse $B^{-1} \bmod \beta$:

$$q_0 \equiv a_0 \cdot B^{-1} \pmod{\beta}, \quad A \leftarrow (A - q_0 B) / \beta.$$

Iterating $n$ times gives the quotient. Each step is a 1-word multiplication plus addmul_1 — natively supported by hardware. No Möller-Granlund reciprocal estimation is needed. Result:

$$T_{\text{exact}}(n) = O(n) \cdot T_{\text{mul\_1}} \approx 0.3 \times T_{\text{knuth\_D}}(n).$$

Applications

Toom-Cook interpolation: dividing by small fixed integers when applying the Vandermonde inverse to recover coefficients (Chapter 8). Toom-8 performs 10-20 such small-prime divisions like /11, /17, /31.
Binomial coefficients: updating $\binom{n}{k}$ incrementally as $\binom{n}{k} = \binom{n}{k-1} \cdot (n-k+1) / k$ — each step is exact division.
Binary Splitting: factoring out common divisors when combining rational sums (Chapter 10).

The power of divexact_by_odd

Inside Toom-8, calx makes heavy use of fixed-divisor exact divisions like divexact_by_odd(c, 3). The inverse $d^{-1} \bmod \beta$ of an odd $d$ is a compile-time constant, so the whole operation collapses into an addmul_1 loop. The optimization landed on 2026-04-19 improved 524K-bit division by 11% (684 μs → 609 μs).

As the name divexact_by_odd suggests, this optimization requires the divisor to be odd. For an even divisor, first right-shift out the powers of 2 and apply the optimization only to the remaining odd factor.

5.6 Division by Special Constants

Division by powers of 2

$A / 2^k$ is a right shift. If $k$ is aligned to a word boundary (a multiple of 64), it is a pointer adjustment only. If not aligned, each word can be shifted cross-boundary with the SHRD instruction. Linear time $O(n)$, but with no branches and no correction logic.

Division by powers of 10

Decimal output and significant-digit handling often need $A / 10^k$. Since $10 = 2 \cdot 5$, one computes $A / 10^k = (A / 2^k) / 5^k$ — right shift, then exact division by the fixed divisor $5^k$.

Division by small constants

For small fixed $d \in \{3, 5, 7, \ldots\}$, use Möller-Granlund's divide by invariant integer method:

$$\hat{q} = \lfloor a \cdot \mu / 2^{64+\ell} \rfloor, \quad \mu = \lfloor 2^{64+\ell} / d \rfloor.$$

$\mu$ is pre-computed, so runtime cost is one MULH (high-half multiply) plus a shift. At most one correction is required, and the whole thing is several times faster than a word-wise divide.

5.7 Algorithm Selection Thresholds — The calx Design

Fast algorithms have a common property: large constant factors but better asymptotic behavior. On small inputs, the naive algorithm wins. Therefore we dispatch by size. The calx division dispatch:

calx division thresholds (as of 2026-04)

Divisor size $b_n$ (words)	Algorithm	Complexity
$b_n < 64$	Knuth D + Möller-Granlund 3-by-2 (`div_basecase`)	$O(n \cdot b_n)$
$64 \le b_n < 3000$ (Int)	Burnikel-Ziegler (`dcpi1_div_qr_n`)	$O(M(n) \log n)$
$b_n \ge 3000$ or unbalanced	$\mu$-div (reciprocal + Barrett)	$O(M(n))$
Float, $b_n \ge 64$	Newton reciprocal iteration	$O(M(n))$
Exact div (fixed small divisor)	Jebelean / divexact_by_odd	$O(n)$

Why does Int use $\mu$-div while Float uses Newton? Int returns an integer quotient and must compute the remainder $R$ simultaneously, so Barrett-style $\mu$-div fits. Float needs only the quotient with continuous (rather than discrete) precision management, where Newton is more natural.

Thresholds are empirical

The exact crossover "when to switch to BZ" depends on the CPU cache size, the multiplication routine constants, and the compiler. The calx value of 64 was measured on Zen 3 + GCC/MSVC Release; other CPUs may prefer different values. Too small: recursion overhead eats into schoolbook's $O(n^2)$ advantage. Too large: the asymptotically worse $O(n^2)$ keeps dominating at sizes where BZ would already be faster. Production libraries typically rely on automatic tuning to select platform-specific thresholds.

5.8 Measurements

We measure calx division at representative sizes to see how the size-range behavior plays out. Measurement environment: Zen 3 (Ryzen 5 5600X), MSVC Release x64, single thread.

Division performance (balanced $n / n$)

Size	calx (μs)	Dominant multiplication layer
$\le 65$ K bits	< 100	schoolbook + BZ + Karatsuba
262 K bits	598	BZ + Toom-4 / Toom-6
524 K bits	1834	Newton + Toom-8 + NTT
1 M bits	3850	Newton + 5-smooth NTT

Measured after the 2026-04-19 Toom-8 optimization + divexact_by_odd folding.

Size-range characteristics

Division speed is mostly determined not by the division logic itself but by the multiplications called internally. Newton-based $\mu$-div invokes many unbalanced multiplications (dividend $\gg$ divisor) during reciprocal computation.

NTT range ($n \ge 1200$ words): 5-smooth NTT shines via its $O(n \log n)$ asymptotics.
Karatsuba range ($22 \le n < 140$ words): classical Karatsuba at $O(n^{1.585})$ is the winner.
Toom-6 / Toom-8 range (600-1200 words): higher-order Toom-Cook dominates; the Toom-8 pre-pad evaluation pays off.
Unbalanced 2:1 in invert_approx (513-1025 words): Toom-4,2 specialization matches the unbalanced piece sizes.

In other words, "making division faster" really means "making the multiplications inside division faster." The reason 524K-bit division improved was the introduction of 5-smooth NTT (Chapter 9) and the Toom-8 optimization of 2026-04-19.

Measurement caveats

Benchmarks are sensitive to CPU thermal state, memory bandwidth, and compiler version. Thermal throttling during back-to-back runs can cost several percent. The calx benchmark suite inserts 30-second cooldowns between sizes and repeats measurements within the same session to cancel out absolute-time drift.

5.9 Summary

Knuth D and 3-by-2 estimation: fastest for small sizes via schoolbook + pre-computed Möller-Granlund reciprocal. Normalization ensures the estimation error stays in $\{0, 1\}$.
Burnikel-Ziegler: divide-and-conquer gives $O(M(n) \log n)$. calx switches to BZ at 64 words.
Newton reciprocal iteration: reduces division to multiplication, $O(M(n))$. Quadratic convergence means a handful of iterations from a double-precision guess suffice. Getting effective_bits wrong corrupts the quotient.
Exact Division: Jebelean's $p$-adic variant at $O(n)$ when the remainder is known zero. Heavily used in Toom-Cook interpolation.
Performance is dominated by multiplication: division speed is mostly determined by the multiplications called internally. The 524K-bit improvement came from enhancements to NTT and Toom-Cook, not to the division logic.

In the next chapter we look at a special case of division called modular arithmetic ($a \bmod m$). The Barrett reduction and Montgomery multiplication that appear in RSA and NTT hot loops are techniques for refactoring the division of this chapter so as to "not divide at all".

References

Knuth, D.E. The Art of Computer Programming, Vol. 2: Seminumerical Algorithms, 3rd ed., Addison-Wesley, 1997. §4.3.1 (Algorithm D).
Burnikel, C., Ziegler, J. "Fast Recursive Division", MPI-I-98-1-022, Max-Planck-Institut für Informatik, 1998.
Möller, N., Granlund, T. "Improved Division by Invariant Integers", IEEE Trans. Computers, 60(2), pp. 165-175, 2011.
Brent, R.P., Zimmermann, P. Modern Computer Arithmetic, Cambridge University Press, 2010. Chapter 1.
Jebelean, T. "An Algorithm for Exact Division", J. Symbolic Computation, 15(2), pp. 169-180, 1993.
Hasselström, K. "Fast Division of Large Integers — A Comparison of Algorithms", Master's thesis, KTH, 2003.

Related Chapters

Chapter 4: Fast Multiplication Algorithms — the multiplication that division reduces to
Chapter 6: Modular Arithmetic — avoiding division via modular reduction
Chapter 7: Integer GCD Algorithms — Euclidean algorithm and its improvements
Chapter 8: Unbalanced Toom-Cook Multiplication — unbalanced multiplication's impact on division performance
Chapter 9: NTT Implementation Comparison — NTT powering large multiplications
[Multiprecision Float] Chapter 10: Binary Splitting and Hypergeometric Series — the single large-integer division at the end