What internal representation does the calx Float use?

The calx Float is represented in the form mantissa × 2^exponent. The mantissa is an arbitrary-precision integer (Int), the exponent is int64_t, and the sign is stored separately. Precision is specified in decimal digits and internally converted to bits via ceil(precision × 3.32192809488736). The default precision is 53 bits (approximately 16 decimal digits) and is managed as a thread_local variable.

How is catastrophic cancellation handled in Float addition/subtraction?

When subtracting numbers of nearly equal magnitude, the upper bits cancel out and the number of significant digits decreases drastically (catastrophic cancellation). In calx, the effective_bits_ field tracks the effective precision of each value. When cancellation occurs, effective_bits_ is automatically reduced. Subsequent operations reference this information to avoid unnecessarily high-precision computation.

When is the FloatOps 3-argument API useful?

It is useful in iterative computations and loops where results are repeatedly stored in the same variable. By reusing a result buffer as in FloatOps::mul(lhs, rhs, result), the overhead of object construction and heap allocation is eliminated. This yields approximately 1.8x speedup for 19-digit precision multiplication and approximately 2.0x for 38-digit precision.

What algorithm does Float division use?

The dividend is left-shifted by the target precision, and then Int division is performed. For Int division, the Schoolbook algorithm is used for operands smaller than 64 words, and the Burnikel-Ziegler algorithm is automatically selected for 64 words or more. When the divisor fits in a single limb (64 bits), the divmod_1 fast path bypasses the general division routine.

Float Arithmetic Algorithms

Overview

Float is a class that represents arbitrary-precision floating-point numbers. Internally, it consists of three components:

$$\text{value} = \text{mantissa} \times 2^{\text{exponent}}$$

mantissa (Int): The significand, stored as an arbitrary-precision integer.
exponent (int64_t): The exponent, representing the power-of-two scale factor.
sign: The sign bit, managed independently of the mantissa.

Precision is specified in decimal digits. The default precision is default_precision_ = 53 (approximately 16 decimal digits), managed as a thread_local variable. Internally, the precisionToBits function converts decimal digits to bits:

$$\text{bits} = \lceil \text{precision} \times 3.32192809488736 \rceil = \lceil \text{precision} \times \log_2 10 \rceil$$

The result of each operation is rounded to the target precision by setResultPrecision(). The rounding mode is specified via the RoundingMode enumeration, which provides four modes:

ToNearest (default): Round-To-Nearest-Even
Truncation: Truncation toward zero
Up: Round toward positive infinity
Down: Round toward negative infinity

The effective_bits_ field tracks the actual effective precision of each value. If the input has low precision (e.g., a 53-bit value converted from double), effectiveComputePrecision automatically reduces the computation precision, avoiding unnecessarily high-precision computation.

For usage details, see the Float API Reference.

Addition and Subtraction

Floating-point addition and subtraction involve aligning the mantissas of two operands with different exponents and performing integer-level addition or subtraction. Same-sign addition is delegated to addUnsigned, and opposite-sign addition (effectively subtraction) to subtractUnsigned.

Exponent Alignment

Consider the addition of two operands $a = m_a \times 2^{e_a}$ and $b = m_b \times 2^{e_b}$ ($e_a \geq e_b$). The exponent difference $\Delta e = e_a - e_b$ is computed, and $m_b$ is right-shifted by $\Delta e$ bits for alignment:

$$a + b = (m_a \cdot 2^{\Delta e} + m_b) \times 2^{e_b}$$

In the implementation, rather than shifting the smaller mantissa, the larger mantissa is left-shifted to preserve precision. The computation precision is determined by $\max(\text{lhs\_bits}, \text{rhs\_bits}) + \text{guard bits}$.

mpn-Level Addition and Subtraction

After alignment, the mantissas are processed by mpn-level add / sub operations. Word-by-word carry/borrow propagation enables efficient addition and subtraction of arbitrary length.

Optimization via rvalue References

operator+ and operator- provide rvalue reference overloads. When a temporary object is passed, its internal buffer is reused, reducing the cost of heap allocation.

Catastrophic Cancellation

When subtracting numbers of nearly equal magnitude, the upper bits cancel out, causing a significant loss of significant digits. subtractUnsigned detects the number of leading zeros in the result and reduces effective_bits_ accordingly. This information is referenced by subsequent operations to avoid computation beyond the lost precision.

Multiplication

Float multiplication consists of three steps: integer multiplication of the mantissas, addition of the exponents, and determination of the sign.

Mantissa Multiplication

Mantissa multiplication reduces directly to Int multiplication. Int multiplication automatically selects Basecase / Karatsuba / Toom-Cook-3 / Toom-Cook-4 / NTT based on the operand word count (see Int Multiplication Algorithms).

Exponent and Sign

$$\text{result.exponent} = \text{lhs.exponent} + \text{rhs.exponent}$$

The sign is determined by XOR (positive if same sign, negative if different signs).

Precision Limiting

Multiplication of two $n$-bit mantissas produces a result of up to $2n$ bits, but lower bits beyond the target precision are unnecessary. setResultPrecision rounds the result to the target precision and discards the excess bits.

Small Buffer Optimization (SBO)

When the resulting mantissa fits in 2 words (128 bits) or fewer, Int's SBO keeps the mantissa on the stack, eliminating heap allocation. For double-precision (53-bit) multiplication, the result is 106 bits $\leq$ 128 bits, benefiting from SBO.

Division

Float division obtains a quotient at the target precision by left-shifting the dividend's mantissa sufficiently and then performing Int division.

Basic Procedure

To compute $a \div b$, the dividend $a$'s mantissa $m_a$ is left-shifted by the target precision plus guard bits:

$$q = \lfloor m_a \cdot 2^{\text{shift}} \mathbin{/} m_b \rfloor$$

The exponent is adjusted as follows:

$$\text{result.exponent} = e_a - e_b - \text{shift}$$

This technique reduces floating-point division to integer division. Int division automatically selects Schoolbook (div_basecase) for fewer than 64 words and Burnikel-Ziegler for 64 words or more (see Int Division Algorithms).

Single-Limb Divisor Fast Path

When the divisor's mantissa fits in a single limb (64 bits), divmod_1 bypasses the general division routine. divmod_1 simply iterates single-word divisions, avoiding the recursive overhead of Burnikel-Ziegler and the allocation of temporary buffers. This path is frequently taken for integer literals and small divisors originating from double.

Relationship with Newton Reciprocal Iteration

The calx Float division does not directly use Newton reciprocal iteration. Newton reciprocal iteration (mu_div_qr) is used indirectly within Burnikel-Ziegler for large-precision Int division. Specifically, for unbalanced division where the word-count ratio between dividend and divisor is large, Newton's method computes an approximate reciprocal of the divisor and reduces the problem to multiplication via a dynamic threshold scheme.

3-Argument FloatOps API

FloatOps is a class that provides the four basic arithmetic operations on Float in a 3-argument form. While the usual operator+ and similar operators construct and return a new object, the 3-argument versions write the result directly into an existing buffer, eliminating the overhead of object construction and heap allocation.

// Standard operator version
Float c = a * b;        // temporary object construction + copy/move

// FloatOps 3-argument version
FloatOps::mul(a, b, c); // writes directly into c's buffer

FloatOps::mul

The 3-argument multiplication writes the mpn multiplication result directly into result.mantissa_'s word array. With the standard operator*, the multiplication result is first stored in a temporary stack buffer and then copied to the result object. The 3-argument version completely eliminates this copy.

FloatOps::add

The 3-argument addition directly implements the general case including exponent differences ($\text{exp\_diff} \neq 0$). It uses a stack buffer of BUF_LIMIT = 128, completely avoiding heap allocation for mantissas of 128 words (approximately $2{,}400$ digits) or fewer.

FloatOps::div

The 3-argument division reuses the result buffer and also includes the divmod_1 fast path for single-limb divisors.

The improvement from the 3-argument API is most pronounced for low-precision operations, because the cost of object construction and heap allocation is relatively large compared to the operation itself.

Rounding and Precision

setResultPrecision

setResultPrecision() is called at the end of every arithmetic operation to round the mantissa to the target number of bits. Specifically:

If the current mantissa bit count exceeds the target, the excess bits are right-shifted
The discarded bit pattern is examined and the rounding direction is determined according to the rounding mode
If necessary, 1 is added to the mantissa (round-up)
The exponent is adjusted by adding the shift amount

Round-To-Nearest-Even

The default rounding mode is Round-To-Nearest-Even. When the discarded bits are exactly at the midpoint, the result is rounded in the direction that makes the least significant bit even. This is identical to the IEEE 754 default rounding mode and minimizes statistical rounding bias.

Round-Up Optimization

When rounding up is needed, writing mantissa_ + Int(1) incurs the construction of a temporary Int(1) object and a multi-precision addition. calx uses IntOps::addDelta(mantissa_, 1) instead, performing only an immediate addition to the least significant word followed by carry propagation. This completely eliminates the cost of temporary object construction.

normalize

normalize() removes leading zeros from the mantissa and adjusts the exponent accordingly. It is called when upper bits cancel out during addition/subtraction or when leading zeros appear after rounding. The clz (count leading zeros) instruction is used to quickly detect the number of leading zero words and bits.

Precision Tracking via effective_bits_

effective_bits_ represents the actual effective precision held by each Float value. For example, a value converted from double has effective_bits_ = 53. effectiveComputePrecision computes the required computation precision from the effective_bits_ of the two operands:

$$\text{compute\_precision} = \min(\text{target\_precision},\; \max(\text{lhs.effective\_bits\_},\; \text{rhs.effective\_bits\_}) + \text{guard})$$

This means that even if 100,000-digit precision is set, computation proceeds at 53-bit precision when the inputs originate from double, avoiding wasteful high-precision computation. The effective_bits_ of the result is appropriately propagated according to the type of operation.