Why does calx exp use argument halving and squaring reconstruction?

The Taylor series exp(z) = Sum z^k/k! converges faster when |z| is small. Argument halving reduces z to z/2^K, computes the Taylor series quickly, and then reconstructs exp(z) via K squarings. The halving count K is adjusted according to precision, optimizing the balance between the number of Taylor terms and squaring cost.

What is the Paterson-Stockmeyer method?

When evaluating a polynomial of degree n, Horner's method requires n multiplications. The Paterson-Stockmeyer method partitions the polynomial into blocks of degree sqrt(n) and precomputes coefficients for each block, evaluating the polynomial in O(sqrt(n)) multiprecision multiplications. This is particularly advantageous for large-precision computation where multiprecision multiplication is expensive.

Float Exponential and Logarithm Algorithms

Q: How are Halley iteration and the AGM method chosen for log computation?

Halley iteration has cubic convergence where precision triples at each step, but requires calling exp at every step. Below 3000 bits (~900 digits), Halley iteration is faster. Between 3001 and 6000 bits, the prime-power decomposition method (log_multiprime) is used. Above 6000 bits, the AGM method with O(M(n) log n) complexity takes over. The AGM method consists solely of multiplications and square roots, giving it superior asymptotic performance at high precision.

Overview

The exponential function $\exp(x)$ and the logarithmic function $\log(x)$ are mutual inverses and form the foundation of the calx Float mathematical function suite. Many transcendental functions including trigonometric, hyperbolic, and power functions reduce to these two, so their efficiency determines overall performance.

Both functions employ multi-level dispatch based on input precision: low-precision inputs use fast paths with fixed-point arithmetic (Q128/Q256), while high-precision inputs use asymptotically optimal algorithms.

Internal computation uses RawFloat. RawFloat is a lightweight structure holding only an mpn array (d), word count (nw), and exponent (exp), eliminating the memory allocation overhead associated with constructing and destroying Float objects. For exp/log, where many intermediate values are generated inside Taylor series loops, this design is essential for performance.

Note that the mathematical constants $\log 2$ and $\log 10$ are computed using the log function, but the low-precision path of log itself is implemented with an atanh series that does not depend on these constants, thereby avoiding circular dependencies.

Algorithm for exp

$\exp(x)$ is dispatched across 4 levels based on precision:

Precision	Function	Method
$\leq 64$ bit	`exp_small`	Argument reduction $z = x - k \ln 2$ + Q128 fixed-point Taylor
$\leq 128$ bit	`exp_medium`	Argument reduction + Q256 fixed-point Taylor
General	`expDoubling`	$K$ halvings + Taylor + $K$ squarings
Large precision	`exp_newton`	Newton iteration $y_{n+1} = y_n (1 + x - \log y_n)$

In practice, expDoubling is used almost exclusively. exp_newton requires log computation within exp, so it is used only when called indirectly from the Halley iteration on the log side.

exp_small / exp_medium

Argument reduction shrinks $x$ to the range $[0, \ln 2)$:

$$x = k \ln 2 + z, \quad k = \lfloor x / \ln 2 \rfloor, \quad z \in [0, \ln 2)$$

$$\exp(x) = 2^k \cdot \exp(z)$$

$\exp(z)$ is computed via Q128 (64-bit precision) or Q256 (128-bit precision) fixed-point Taylor series. Since $|z| < \ln 2 \approx 0.693$ is small, convergence requires only a few terms. Multiplication by $2^k$ is realized by exponent addition, so its cost is zero.

Argument Halving and Squaring Reconstruction

expDoubling is the primary algorithm for computing exp at arbitrary precision, combining argument halving with squaring reconstruction.

Argument Halving

$x$ is halved $K$ times to obtain $z = x / 2^K$. $|z|$ becomes sufficiently small for the Taylor series to converge quickly:

$$\exp(z) = \sum_{k=0}^{N} \frac{z^k}{k!} + O(z^{N+1})$$

The halving count $K$ is chosen based on precision $p$. Increasing $K$ reduces the number of Taylor terms $N$ but increases the number of squarings for reconstruction. The optimal $K$ is approximately $K \approx \sqrt{p / \log_2 p}$.

Squaring Reconstruction

$\exp(x)$ is reconstructed from $\exp(z)$:

$$\exp(x) = \exp(z \cdot 2^K) = \exp(z)^{2^K}$$

This requires $K$ squarings. In the RawFloat squaring reconstruction, mpn::square is used, exploiting symmetry for faster computation than general multiplication.

Need for Guard Bits

In squaring reconstruction, rounding error can be amplified by up to a factor of 2 at each step. After $K$ squarings, the error may grow by up to $2^K$, so $K + \alpha$ guard bits are added for computation, where $\alpha$ provides margin for Taylor series truncation error.

Paterson-Stockmeyer Method

Evaluating the Taylor series with Horner's method requires $N$ multiprecision multiplications. The Paterson-Stockmeyer method partitions the polynomial into blocks of degree $\sqrt{N}$, evaluating it in $O(\sqrt{N})$ multiprecision multiplications.

Let $s = \lceil \sqrt{N} \rceil$ and rewrite $p(z) = \sum_{k=0}^{N} a_k z^k$ as:

$$p(z) = \sum_{i=0}^{s-1} \left( \sum_{j=0}^{s-1} a_{is+j} \, z^j \right) (z^s)^i$$

Precompute $z, z^2, \ldots, z^s$ ($s-1$ multiplications)
Evaluate each inner polynomial via Horner's method (scalar multiplications with coefficients only)
Evaluate the outer polynomial via Horner's method in $z^s$ ($s-1$ multiplications)

The total of $O(\sqrt{N})$ multiprecision multiplications yields a significant performance improvement over Horner's method for large-precision computation.

Algorithm for log

$\log(x)$ is dispatched across 5 levels based on precision:

Precision	Function	Method
$\leq 64$ bit	`log_newton`	double initial approximation + Halley iteration (Q128 exp)
65 -- 230 bit	`log_medium`	Q256 fixed-point atanh series
231 -- 3000 bit	`log_newton`	Q256 initial approximation + Halley iteration
3001 -- 6000 bit	`log_multiprime`	Prime-power decomposition + $\Sigma\log(p_i)$
$> 6000$ bit	`log_core`	AGM method

Argument Decomposition

The input $x > 0$ is decomposed into mantissa and exponent:

$$x = f \cdot 2^k, \quad f \in [1, 2)$$

$$\log(x) = \log(f) + k \cdot \log 2$$

Since $f \in [1, 2)$, we have $\log(f) \in [0, \ln 2)$, which is small, guaranteeing series convergence. $k \cdot \log 2$ is computed using the constant cache.

Low-Precision Path via atanh Series

$\log(f)$ is computed using the atanh series:

$$\log(f) = 2 \operatorname{atanh}\!\left(\frac{f-1}{f+1}\right) = 2 \sum_{k=0}^{\infty} \frac{1}{2k+1} \left(\frac{f-1}{f+1}\right)^{2k+1}$$

Since $t = (f-1)/(f+1) \in [0, 1/3)$, we have $t^2 < 1/9$, which is small, so the series converges rapidly. It is computed directly in Q128 (64-bit) / Q256 (128-bit) fixed-point arithmetic.

This path does not use constants such as $\log 2$, so it can be safely called during constant cache initialization.

Halley Iteration for log

In the 231 -- 3000 bit precision range, Halley iteration is used to compute $\log(f)$. Although the function is named log_newton, the actual iteration formula is Halley's method.

Iteration Formula

$$y_{n+1} = y_n + 2 \cdot \frac{f - \exp(y_n)}{f + \exp(y_n)}$$

This is Halley's iteration applied to $g(y) = \exp(y) - f = 0$, exhibiting cubic convergence. That is, the number of correct bits approximately triples at each step:

$$53 \to 150 \to 450 \to 1350 \to \cdots$$

Initial Approximation

The initial approximation is chosen based on the precision range:

$\leq 450$ bit: std::log(double) provides a $\sim$53-bit initial approximation, reaching target precision in 1--2 Halley iterations
$> 450$ bit: Q256 atanh (log_medium) provides a $\sim$240-bit initial approximation, reaching target precision in 1 Halley iteration

Calling exp

Each Halley step requires computing $\exp(y_n)$. Since the precision of $y_n$ increases as the iteration progresses, the corresponding exp dispatch changes accordingly:

Step 1 ($\sim$53 bit): exp_small (Q128 Taylor)
Step 2 ($\sim$150 bit): exp_medium (Q256 Taylor) or expDoubling
Step 3 ($\sim$450 bit): expDoubling

The exp computation at each step is performed at the current step's precision, not the target precision, so early steps are very inexpensive.

AGM-Based log

For precision exceeding 6000 bits, the Arithmetic-Geometric Mean (AGM) method is used. Between 3001 and 6000 bits, the prime-power decomposition method (log_multiprime) serves as an intermediate tier.

Definition of AGM

For $a_0, b_0 > 0$, define the iteration:

$$a_{n+1} = \frac{a_n + b_n}{2}, \quad b_{n+1} = \sqrt{a_n \cdot b_n}$$

$a_n$ and $b_n$ converge quadratically to a common limit $M(a_0, b_0)$. Each step requires one multiplication and one square root, costing $O(M(n))$, with $O(\log p)$ steps total, yielding an overall complexity of $O(M(p) \log p)$.

Application to Logarithm

The following formula reduces $\log(x)$ to the AGM:

$$\log(x) = \frac{\pi}{2 \cdot \operatorname{AGM}(1,\, 4/s)} - m \cdot \log 2$$

Here $s = x \cdot 2^m$, where $m$ is a parameter that makes $s$ sufficiently large to accelerate AGM convergence. A larger $m$ makes $4/s$ smaller, increasing the ratio of the initial AGM values $(1, 4/s)$, resulting in convergence in fewer steps.

This formula requires $\pi$ and $\log 2$, but these are held in thread_local caches, so recomputation at the same precision does not occur.

Threshold Settings

Halley iteration has cubic convergence, but each step requires calling exp, making the per-step cost high. The AGM method consists only of multiplication and square root operations per step, with asymptotic complexity $O(M(p) \log p)$.

Based on empirical benchmarks, the following thresholds have been established:

3000 bits ($\sim$900 digits): crossover between Halley iteration and the prime-power decomposition method (log_multiprime). In the 250--900 digit range, Halley is 30--40% faster than the prime-power decomposition method
6000 bits ($\sim$1800 digits): crossover between the prime-power decomposition method and AGM

Computing log2 and log10

$\log_2(x)$ and $\log_{10}(x)$ can be reduced to $\log(x)$ via the change-of-base formula, but direct computation is faster at low precision.

Precision	Method
$\leq 64$ bit	Q128 dedicated path (log2_small / log10_small)
65 -- 128 bit	Q256 atanh (log2_medium / log10_medium)
129 -- 2000 bit	Halley iteration (log2_newton / log10_newton)
$> 2000$ bit	$\log(x) / \log(2)$ or $\log(x) / \log(10)$ (using constant cache)

The low-precision paths hold $\log_2 e$ and $\log_{10} e$ as compile-time constants in Q128/Q256 fixed-point, multiplying by the atanh series result. This eliminates dependence on the constant cache and avoids circular calls during cache initialization.

The high-precision path computes log(x) first, then divides by $\log 2$ or $\log 10$ held in the thread_local cache. Once computed, constants are not recomputed at the same precision or lower.

Design Decisions

Direct mpn Computation via RawFloat

Inside the Taylor series loop, multiplication, division, and addition occur at every iteration. Using Float objects would require constructing and destroying the Int mantissa at each operation, making heap allocation dominant. RawFloat consists only of d (mpn pointer), nw (word count), and exp (exponent), and by calling mpn functions directly, it achieves zero-allocation Taylor loops.

NTT Cache (rf_mul_cached)

In Halley iteration, $\exp(y_n)$ appears in both the numerator and denominator. rf_mul_cached caches the NTT forward transform result, executing the second and subsequent multiplications with the same operand using only the inverse transform. This reduces the multiplication cost per Halley iteration step.

Precision Tracking via effective_bits_

Float internally maintains effective_bits_ to track the effective precision of each value. For example, computing exp at 1000-bit precision on a 64-bit precision input still yields only 64-bit precision. Thanks to effective_bits_, low-precision dispatch paths are selected for low-precision inputs, avoiding unnecessary high-precision computation.

References

Brent, R. P. and Zimmermann, P. (2010). Modern Computer Arithmetic. Cambridge University Press. Chapter 4: Elementary and Special Function Evaluation.
Paterson, M. S. and Stockmeyer, L. J. (1973). "On the Number of Nonscalar Multiplications Necessary to Evaluate Polynomials". SIAM Journal on Computing, 2(1), 60--66.
Borwein, J. M. and Borwein, P. B. (1987). Pi and the AGM. Wiley.
Muller, J.-M. et al. (2018). Handbook of Floating-Point Arithmetic. 2nd ed. Birkhäuser. Chapter 11: Evaluating Floating-Point Elementary Functions.

Related article: Float Square Root Algorithm (integer reduction, correct rounding, low-precision fast path)