Float Arithmetic Algorithms
Overview
Float is a class that represents arbitrary-precision floating-point numbers. Internally, it consists of three components:
$$\text{value} = \text{mantissa} \times 2^{\text{exponent}}$$
- mantissa (
Int): The significand, stored as an arbitrary-precision integer. - exponent (
int64_t): The exponent, representing the power-of-two scale factor. - sign: The sign bit, managed independently of the mantissa.
Precision is specified in decimal digits. The default precision is default_precision_ = 53 (approximately 16 decimal digits),
managed as a thread_local variable.
Internally, the precisionToBits function converts decimal digits to bits:
$$\text{bits} = \lceil \text{precision} \times 3.32192809488736 \rceil = \lceil \text{precision} \times \log_2 10 \rceil$$
The result of each operation is rounded to the target precision by setResultPrecision().
The rounding mode is specified via the RoundingMode enumeration, which provides four modes:
- ToNearest (default): Round-To-Nearest-Even
- Truncation: Truncation toward zero
- Up: Round toward positive infinity
- Down: Round toward negative infinity
The effective_bits_ field tracks the actual effective precision of each value.
If the input has low precision (e.g., a 53-bit value converted from double),
effectiveComputePrecision automatically reduces the computation precision,
avoiding unnecessarily high-precision computation.
For usage details, see the Float API Reference.
Addition and Subtraction
Floating-point addition and subtraction involve aligning the mantissas of two operands with different exponents
and performing integer-level addition or subtraction.
Same-sign addition is delegated to addUnsigned, and opposite-sign addition (effectively subtraction) to subtractUnsigned.
Exponent Alignment
Consider the addition of two operands $a = m_a \times 2^{e_a}$ and $b = m_b \times 2^{e_b}$ ($e_a \geq e_b$). The exponent difference $\Delta e = e_a - e_b$ is computed, and $m_b$ is right-shifted by $\Delta e$ bits for alignment:
$$a + b = (m_a \cdot 2^{\Delta e} + m_b) \times 2^{e_b}$$
In the implementation, rather than shifting the smaller mantissa, the larger mantissa is left-shifted to preserve precision. The computation precision is determined by $\max(\text{lhs\_bits}, \text{rhs\_bits}) + \text{guard bits}$.
mpn-Level Addition and Subtraction
After alignment, the mantissas are processed by mpn-level add / sub operations.
Word-by-word carry/borrow propagation enables efficient addition and subtraction of arbitrary length.
Optimization via rvalue References
operator+ and operator- provide rvalue reference overloads.
When a temporary object is passed, its internal buffer is reused,
reducing the cost of heap allocation.
Catastrophic Cancellation
When subtracting numbers of nearly equal magnitude, the upper bits cancel out,
causing a significant loss of significant digits.
subtractUnsigned detects the number of leading zeros in the result
and reduces effective_bits_ accordingly.
This information is referenced by subsequent operations to avoid computation beyond the lost precision.
Multiplication
Float multiplication consists of three steps: integer multiplication of the mantissas, addition of the exponents, and determination of the sign.
Mantissa Multiplication
Mantissa multiplication reduces directly to Int multiplication.
Int multiplication automatically selects
Basecase / Karatsuba / Toom-Cook-3 / Toom-Cook-4 / NTT based on the operand word count
(see Int Multiplication Algorithms).
Exponent and Sign
$$\text{result.exponent} = \text{lhs.exponent} + \text{rhs.exponent}$$
The sign is determined by XOR (positive if same sign, negative if different signs).
Precision Limiting
Multiplication of two $n$-bit mantissas produces a result of up to $2n$ bits,
but lower bits beyond the target precision are unnecessary.
setResultPrecision rounds the result to the target precision
and discards the excess bits.
Small Buffer Optimization (SBO)
When the resulting mantissa fits in 2 words (128 bits) or fewer,
Int's SBO keeps the mantissa on the stack,
eliminating heap allocation.
For double-precision (53-bit) multiplication,
the result is 106 bits $\leq$ 128 bits, benefiting from SBO.
Division
Float division obtains a quotient at the target precision by
left-shifting the dividend's mantissa sufficiently and then performing Int division.
Basic Procedure
To compute $a \div b$, the dividend $a$'s mantissa $m_a$ is left-shifted by the target precision plus guard bits:
$$q = \lfloor m_a \cdot 2^{\text{shift}} \mathbin{/} m_b \rfloor$$
The exponent is adjusted as follows:
$$\text{result.exponent} = e_a - e_b - \text{shift}$$
This technique reduces floating-point division to integer division.
Int division automatically selects Schoolbook (div_basecase) for fewer than 64 words
and Burnikel-Ziegler for 64 words or more
(see Int Division Algorithms).
Single-Limb Divisor Fast Path
When the divisor's mantissa fits in a single limb (64 bits),
divmod_1 bypasses the general division routine.
divmod_1 simply iterates single-word divisions,
avoiding the recursive overhead of Burnikel-Ziegler and the allocation of temporary buffers.
This path is frequently taken for integer literals and small divisors originating from double.
Relationship with Newton Reciprocal Iteration
The calx Float division does not directly use Newton reciprocal iteration.
Newton reciprocal iteration (mu_div_qr) is used indirectly within Burnikel-Ziegler
for large-precision Int division.
Specifically, for unbalanced division where the word-count ratio between dividend and divisor is large,
Newton's method computes an approximate reciprocal of the divisor and reduces the problem to multiplication
via a dynamic threshold scheme.
3-Argument FloatOps API
FloatOps is a class that provides the four basic arithmetic operations on Float in a 3-argument form.
While the usual operator+ and similar operators construct and return a new object,
the 3-argument versions write the result directly into an existing buffer,
eliminating the overhead of object construction and heap allocation.
// Standard operator version
Float c = a * b; // temporary object construction + copy/move
// FloatOps 3-argument version
FloatOps::mul(a, b, c); // writes directly into c's buffer
FloatOps::mul
The 3-argument multiplication writes the mpn multiplication result
directly into result.mantissa_'s word array.
With the standard operator*, the multiplication result is first stored in a temporary stack buffer
and then copied to the result object.
The 3-argument version completely eliminates this copy.
FloatOps::add
The 3-argument addition directly implements the general case including exponent differences ($\text{exp\_diff} \neq 0$).
It uses a stack buffer of BUF_LIMIT = 128,
completely avoiding heap allocation for mantissas of 128 words (approximately $2{,}400$ digits) or fewer.
FloatOps::div
The 3-argument division reuses the result buffer and also includes
the divmod_1 fast path for single-limb divisors.
The improvement from the 3-argument API is most pronounced for low-precision operations, because the cost of object construction and heap allocation is relatively large compared to the operation itself.
Rounding and Precision
setResultPrecision
setResultPrecision() is called at the end of every arithmetic operation
to round the mantissa to the target number of bits. Specifically:
- If the current mantissa bit count exceeds the target, the excess bits are right-shifted
- The discarded bit pattern is examined and the rounding direction is determined according to the rounding mode
- If necessary, 1 is added to the mantissa (round-up)
- The exponent is adjusted by adding the shift amount
Round-To-Nearest-Even
The default rounding mode is Round-To-Nearest-Even. When the discarded bits are exactly at the midpoint, the result is rounded in the direction that makes the least significant bit even. This is identical to the IEEE 754 default rounding mode and minimizes statistical rounding bias.
Round-Up Optimization
When rounding up is needed, writing mantissa_ + Int(1) incurs
the construction of a temporary Int(1) object and a multi-precision addition.
calx uses IntOps::addDelta(mantissa_, 1) instead,
performing only an immediate addition to the least significant word followed by carry propagation.
This completely eliminates the cost of temporary object construction.
normalize
normalize() removes leading zeros from the mantissa and adjusts the exponent accordingly.
It is called when upper bits cancel out during addition/subtraction or when leading zeros appear after rounding.
The clz (count leading zeros) instruction is used to quickly detect the number of leading zero words and bits.
Precision Tracking via effective_bits_
effective_bits_ represents the actual effective precision held by each Float value.
For example, a value converted from double has effective_bits_ = 53.
effectiveComputePrecision computes the required computation precision
from the effective_bits_ of the two operands:
$$\text{compute\_precision} = \min(\text{target\_precision},\; \max(\text{lhs.effective\_bits\_},\; \text{rhs.effective\_bits\_}) + \text{guard})$$
This means that even if 100,000-digit precision is set,
computation proceeds at 53-bit precision when the inputs originate from double,
avoiding wasteful high-precision computation.
The effective_bits_ of the result is appropriately propagated according to the type of operation.