How does calx Float sqrt achieve correct rounding?

It computes the root and remainder exactly using integer sqrtRem, and if the remainder is nonzero, sets the LSB of the root to 1 (the sticky bit method). This information guarantees correct rounding for all rounding modes, including IEEE 754 Round-To-Nearest-Even.

How does Float sqrt reduce to integer square root?

For a floating-point number m*2^e with mantissa m and exponent e, the mantissa is scaled to 2(p+1) bits and the exponent is adjusted to be even. Since sqrt(m*2^(2k)) = sqrt(m)*2^k, computing integer sqrtRem of m yields the floating-point square root.

What are the low-precision fast paths in Float sqrt?

For precision up to 64 bits, sqrt_newton_1 is used (std::sqrt initial approximation + one Newton iteration + _udiv128/_umul128 intrinsics). For precision up to 128 bits, sqrt_small_2 is used (direct dc_sqrtrem call with n=2). These paths bypass Float object construction as much as possible and compute entirely with CPU intrinsics.

How does calx Float sqrt compare to MPFR in performance?

Compared to MPFR 4.2.1, the calx/MPFR ratios are: 7.1x at 50 digits, 5.1x at 100 digits, and 3.3x at 500 digits (MPFR being faster). The larger gap at low precision is due to Float object construction overhead; at high precision the asymptotic performance of the Zimmermann algorithm narrows the ratio.

Float Square Root Algorithm

Overview

Float::sqrt computes arbitrary-precision floating-point square roots by reducing to integer square root (sqrtRem). A sticky bit is constructed from the remainder, achieving correct rounding compliant with IEEE 754 for all rounding modes.

$$\sqrt{m \cdot 2^e} = \sqrt{m'} \cdot 2^{e'/2}$$

Here $m'$ is the mantissa scaled to $2(p+1)$ bits, and $e'$ is the exponent adjusted to be even. From the root and remainder returned by integer sqrtRem, a correctly rounded floating-point square root is constructed.

Dispatch is performed in three stages depending on input precision:

Precision	Mantissa	Function	Method
$\leq 64$ bit	1 word	`sqrt_newton_1`	std::sqrt + 1 Newton iteration + intrinsics
$\leq 128$ bit	$\leq 2$ words	`sqrt_small_2`	Direct dc_sqrtrem (n=2) call
$> 128$ bit	Arbitrary	`sqrt_core`	Reduction to Int::sqrtRem

For usage, see the Float API Reference. For details on integer square root (sqrtrem1, sqrtrem2, Zimmermann's dc_sqrtrem), see Int Square Root Algorithm.

Reduction to Integer Square Root

The following describes the procedure for reducing a Float square root to integer operations. This principle is common to all dispatch paths.

Scaling

Given a floating-point number $x = m \cdot 2^e$ (mantissa $m$, exponent $e$), we compute the square root to a target precision of $p$ bits.

Determining root bit-length: $\text{target\_root\_bits} = p + 1$ (one extra bit for rounding)
Mantissa scaling: Scale $m$ to a $2(p+1)$-bit integer $I$. Left-shift $m$ by $\text{base\_shift} = 2(p+1) - \text{bitLength}(m)$ bits.
Making the exponent even: If the adjusted exponent $e' = e - \text{base\_shift}$ is odd, apply one additional bit of shift to make it even: $$e' \bmod 2 = 0$$
Integer square root: Compute $s = \lfloor \sqrt{I} \rfloor$ and $r = I - s^2$
Result exponent: $e_{\text{result}} = e' / 2$

$$\sqrt{m \cdot 2^{e'}} = \sqrt{I} \cdot 2^{e'/2} \approx s \cdot 2^{e'/2}$$

Why the Exponent Must Be Even

$\sqrt{2^{2k}} = 2^k$ is an integer, but $\sqrt{2^{2k+1}} = 2^k \sqrt{2}$ is irrational. By making the exponent even, the square root can be completed entirely in the integer domain. For an odd exponent, the mantissa is doubled ($+1$ bit shift) and the exponent is decremented by 1.

Correct Rounding

To perform exact rounding of a floating-point square root, one must know whether the true square root lies strictly on one side of $s$ or $s+1$. calx achieves this using the sticky bit method.

How the Sticky Bit Works

The remainder $r = I - s^2$ returned by integer sqrtRem is used:

$r = 0$: $\sqrt{I} = s$ (exact). The LSB of the root is left unchanged
$r \neq 0$: $s < \sqrt{I} < s + 1$. The LSB of the root is set to 1 (sticky bit)

$$s_{\text{sticky}} = \begin{cases} s & \text{if } r = 0 \\ s \mathbin{|} 1 & \text{if } r \neq 0 \end{cases}$$

This information propagates to the subsequent setResultPrecision() (rounding routine), enabling correct identification of midpoints (values exactly on the rounding boundary). When rounding the $(p+1)$-bit root to $p$ bits, a sticky bit of 1 indicates that the true value is above the midpoint, allowing Round-To-Nearest-Even to round in the correct direction.

Relationship with Rounding Modes

The sticky bit guarantees correct rounding for all of the following rounding modes:

Round-To-Nearest-Even (default): rounds toward the even value at midpoint. The sticky bit is needed to identify midpoints
Truncation: always rounds toward zero. The sticky bit has no effect
Round-Up / Round-Down: the rounding direction is determined by whether the remainder is nonzero

This method is the same approach adopted by MPFR and satisfies the requirements of IEEE 754.

sqrt_newton_1 ($\leq 64$ bit Precision)

Computes the square root of a 1-word mantissa, completely bypassing the Float abstraction. Avoids Int object construction and destruction, performing all computation using CPU intrinsics only.

Scaling to 128 bits: From the mantissa $m$ and exponent $e$, construct a 128-bit integer $M$ (M_hi:M_lo). Set desired_bits = 126 so that even after parity adjustment of $+1$, it fits within 128 bits.
Initial approximation: Obtain a $\sim$53-bit initial approximation $s_0$ via std::sqrt(double(M)). Since $M$ is 128 bits but the double mantissa is 53 bits, $s_0$ approximates only the leading bits.
One Newton iteration: $$s_1 = \frac{s_0 + \lfloor M / s_0 \rfloor}{2}$$ Execute a 128 ÷ 64 bit division using _udiv128(M_hi, M_lo, s0, &rem). Overflow-safe average: (s0 >> 1) + (q >> 1) + ((s0 & q) & 1). This single iteration doubles the precision from $\sim$53 to $\sim$64 bits.
±1 correction: Compute $s_1^2$ using _umul128(s1, s1, &sq_hi) and verify that $s_1^2 \leq M < (s_1 + 1)^2$. If the condition is not met, adjust $s_1$ by $\pm 1$.
Sticky bit: If $s_1^2 \neq M$, set the LSB of $s_1$ to 1.

The result is constructed as a Float directly from Int(s1), then rounded by setResultPrecision(). Thanks to SBO (Small Buffer Optimization), a 1-word Int is constructed without heap allocation.

sqrt_small_2 ($\leq 128$ bit Precision)

Computes the square root of a 1–2 word mantissa directly using mpn-level dc_sqrtrem. Avoids the overhead of Int-level division and multiplication, completing entirely in fixed-size stack buffers.

Scaling to a 4-word buffer: Place the mantissa into a 256-bit (4-word) stack buffer scaled[6]. Two extra words are allocated because dc_sqrtrem uses the region np[n..n+2l-1].
Normalization: dc_sqrtrem requires scaled[3] >= 2^{62}. Measure leading zeros with std::countl_zero() and left-shift by an even number of bits only (an odd shift would produce a half-word offset in the square root result).
dc_sqrtrem call: Call dc_sqrtrem(sp, scaled, n=2, scratch) directly. The resulting root is stored in sp[0..1] (2 words), and the remainder in scaled[0..1]. The carry flag is also returned as part of the remainder.
Denormalization: If normalization shifted left by $t$ bits, right-shift the root by $t/2$ bits to restore it ($\sqrt{x \cdot 2^t} = \sqrt{x} \cdot 2^{t/2}$).
Sticky bit: If the carry or remainder (scaled[0], scaled[1]) is nonzero, set sp[0] |= 1.

An Int is constructed from the 2 words using Int::fromRawWordsPreNormalized (SBO: no heap allocation), a Float is built from it, and setResultPrecision() performs the rounding.

sqrt_core (General Purpose)

The general routine for arbitrary precision beyond 128 bits. It faithfully follows the integer reduction procedure and computes the integer square root using IntSqrt::sqrtRem().

precision_bits = precisionToBits(precision), target_root_bits = precision_bits + 1
Left-shift the mantissa $m$ to desired_bits = 2 * target_root_bits bits
Adjust the exponent to be even
Compute $s = \lfloor\sqrt{I}\rfloor$ and the remainder using IntSqrt::sqrtRem(scaled, remainder)
If the remainder is nonzero, set s |= Int(1) (sticky bit)
Construct Float(s, result_exponent, false) and round with setResultPrecision

Internally, Zimmermann's dc_sqrtrem is called recursively depending on input size. The integer square root of $n$ words is computed in $O(M(n))$ ($M(n)$ is the cost of $n$-word multiplication), the same asymptotic complexity as division via Newton iteration.

Related article: Int Square Root Algorithm — Details of sqrtrem1, sqrtrem2, Zimmermann's dc_sqrtrem, and isSquare

References

Zimmermann, P. (1999). "Karatsuba Square Root". INRIA Research Report RR-3805.
Brent, R. P. and Zimmermann, P. (2010). Modern Computer Arithmetic. Cambridge University Press. Chapter 3.5: Square Root.
Muller, J.-M. et al. (2018). Handbook of Floating-Point Arithmetic. 2nd ed. Birkhäuser. Chapter 8: Square Root.
MPFR Documentation. https://www.mpfr.org/mpfr-current/mpfr.html