Chapter 6: Modular Arithmetic

Barrett, Montgomery, CRT, and Modular Exponentiation

Modular reduction $a \bmod m$ is a special case of division, yet it has its own deep theory and optimization. RSA's $m^e \bmod n$, elliptic-curve point doubling, every NTT round, and Miller-Rabin primality testing all perform enormous numbers of modular multiplications. Whether a single modular multiplication takes $1\,\mu\text{s}$ or $0.1\,\mu\text{s}$ is what decides whether a cryptosystem is practical.

This chapter covers Barrett reduction, Montgomery multiplication, the Chinese Remainder Theorem, fast modular exponentiation, and constant-time implementations for cryptographic use. We follow the calx Montgomery implementation — in particular the n=8 (512-bit) assembly kernel — to see how far one can push a modern CPU. On 512-bit exponentiation, calx completes in roughly $32\,\mu\text{s}$.

6.1 Basics — Why Avoid Division?

Modular remainder $a \bmod m$ is by definition derived from the division $\lfloor a / m \rfloor$:

$$a \bmod m = a - m \cdot \lfloor a / m \rfloor.$$

As the previous chapter showed, multiprecision division is 2-5 times slower than multiplication. In cryptography and NTT, however, we often perform hundreds of millions of modular multiplications with the same modulus $m$. By pre-computing something about $m$, we can get the remainder with just 1-2 multiplications per call.

Goal: replace division with multiplication

Under the assumption of a fixed $m$, both Barrett and Montgomery pre-compute a reciprocal-like quantity of $m$, so that runtime reduces to two multiplications. Which to use:

  • Barrett: single-shot remainder, or remainder of a non-product $a$. Works for any modulus $m$.
  • Montgomery: when multiplication and reduction chain together, as in $(a \cdot b) \bmod m$. Requires odd $m$ (must be coprime to $R = \beta^n$). Breaks even with Barrett once you chain $k \ge 3$ multiplications.

6.2 Barrett Reduction

Barrett's idea (1986) was to use a pre-computed "pseudo-reciprocal" $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications.

The algorithm

If $m$ has $n$ words, take $k = 2n \cdot 64$:

$$\mu = \left\lfloor \frac{2^k}{m} \right\rfloor, \quad \hat{q} = \left\lfloor \frac{a \cdot \mu}{2^k} \right\rfloor, \quad r = a - m \cdot \hat{q}.$$

$\hat{q}$ is an approximation of the true quotient $\lfloor a/m \rfloor$ with error $\le 2$. Hence $r$ is at most about $2m$ too large, correctable by at most two subtractions.

Why is it fast?

$a \cdot \mu$ is $2n$-word-by-$n$-word so the full product is $3n$ words — roughly $2\times$ multiplication cost. But we only need the top $2n$ words, cutting that in half (Karatsuba-style truncated multiplication). Overwhelmingly lighter than $O(n^2)$ division.

Numerical example (small modulus)

$m = 7$, $k = 8$: $\mu = \lfloor 256 / 7 \rfloor = 36$. Reduce $a = 97$ modulo 7:

$$\hat{q} = \lfloor 97 \cdot 36 / 256 \rfloor = \lfloor 3492 / 256 \rfloor = 13,$$ $$r = 97 - 7 \cdot 13 = 97 - 91 = 6.$$

True answer $97 \bmod 7 = 6$ matches. A lucky case where no correction was needed.

Barrett and $\mu$-division

The $\mu$-div in the previous chapter applies Barrett's logic to unbalanced division. Barrett alone produces only the remainder; $\mu$-div returns both quotient and remainder.

6.3 Montgomery Multiplication

Montgomery's idea (1985) was to keep integers in a transformed Montgomery form $\bar{a} = a R \bmod m$ (with $R = 2^{64n}$) and perform multiplication + reduction in this form "without any division."

Montgomery form and REDC

Montgomery form

The Montgomery representation of $a$ is $\bar{a} = a R \bmod m$. Contrast with ordinary multiplication:

$$\bar{a} \cdot \bar{b} = a b R^2 \bmod m \neq (ab) R.$$

A naive multiplication leaves an extra factor of $R$. The operation that removes it is REDC:

$$\text{REDC}(T) = T R^{-1} \bmod m.$$

Then $\text{REDC}(\bar{a} \bar{b}) = (abR^2) R^{-1} = abR = \overline{ab}$, keeping the product in Montgomery form.

The REDC algorithm

This is the magic of Montgomery. Given $T$ (at most $2n$ words):

  1. Compute $u = T \cdot (-m^{-1}) \bmod R$ (only the bottom $n$ words are needed).
  2. Compute $T' = T + u \cdot m$. Then $T' \equiv T \pmod m$ and $T' \equiv 0 \pmod R$.
  3. $T'' = T' / R$ is just a right shift (drop the bottom $n$ words).
  4. If needed, subtract $m$ once from $T''$ to canonicalize.
$$T'' = \text{REDC}(T) = \frac{T + ((T \cdot (-m^{-1})) \bmod R) \cdot m}{R}.$$

The computation is two multiplications ($u$ and $u \cdot m$), an add, a shift, and at most one subtraction. Division hides in the pre-computed $m^{-1} \bmod R$. With $m$ fixed, $m^{-1}$ is computed only once.

Cost of input/output conversion

Converting an ordinary integer $a$ to Montgomery form costs a single Montgomery multiplication ($\bar{a} = aR \bmod m$), and the inverse is REDC($\bar{a} \cdot 1$). For $k$ multiplications in Montgomery space, the overhead is 2 conversions plus $k$ Montgomery multiplications. Beyond about $k \ge 3$ this beats ordinary reduction.

6.4 The CIOS Algorithm and calx Implementation

When implementing REDC, one can choose between SOS (Separated Operand Scanning — compute $T = a \cdot b$ fully, then reduce) and CIOS (Coarsely Integrated Operand Scanning — interleave multiplication and reduction). CIOS saves intermediate storage and is fastest at typical crypto sizes (n=4-64).

Structure of CIOS

The inner loop over each word $b_i$ of $b$ does:

  1. Multiply step: $t \leftarrow t + a \cdot b_i$ (fixes one more word of $T$).
  2. Reduce step: compute $q \leftarrow t_0 \cdot m_{\text{inv}} \bmod \beta$, then $t \leftarrow (t + q \cdot m) / \beta$ (zeroes out the bottom word).

After $n$ iterations, $t$ is the REDC result.

calx C++ baseline

// IntModular.cpp: mont_redc (generic CIOS)
// Note: mpn::addmul_1 returns the single-word carry out of the n-word addmul.
//       When tn > i+n, that carry must be propagated up through the higher
//       words (the inner while loop).
inline void mont_redc(uint64_t* r, uint64_t* t, size_t tn,
                      const uint64_t* m, size_t n, uint64_t m_inv) {
    for (size_t i = 0; i < n; ++i) {
        uint64_t q = t[i] * m_inv;                     // q = T[i] * (-m^{-1}) mod beta
        uint64_t carry = mpn::addmul_1(t + i, m, n, q); // T += q*m*beta^i (local result)
        // Propagate the single carry out of addmul_1 upward (only while non-zero)
        for (size_t j = i + n; carry && j < tn; ++j) {
            uint64_t sum = t[j] + carry;
            carry = (sum < t[j]) ? 1 : 0;
            t[j] = sum;
        }
    }
    std::memcpy(r, t + n, n * sizeof(uint64_t));
    if (mpn::cmp(r, n, m, n) >= 0) {
        mpn::sub(r, r, n, m, n);                        // normalize (T < m)
    }
}

The generic path is independent of $n$. For specific sizes (n=4, 8, 16, 32) we dispatch to hand-written assembly kernels.

6.5 Montgomery n=8 Specialization — calx Assembly

Use cases like RSA-4096 and ECDSA P-521 lead to 512-bit moduli (8 words). An assembly kernel specialized to this one size runs 30-50% faster than generic CIOS. The calx mpn_mont_mul_8 is organized into three phases.

Three-phase structure

mpn_mont_mul_8 structure

  1. Phase 1: 8×8 full multiplication (SOS-style, stack-allocated 17-word product buffer).
  2. Phase 2: 8 REDC iterations (compute $q_i = t_i \cdot m_{\text{inv}}$ and shrink $T$ by one word per iter).
  3. Phase 3: conditional subtraction ($T \ge m$ then subtract $m$).

Accumulator register residency

The 8-word accumulator $r_8, r_9, \ldots, r_{15}$ sits in GPRs (r8-r15), significantly reducing memory traffic:

; Phase 1 skeleton from mpn_x64_mont.asm
xor r8d, r8d          ; Zero 8 accumulators
xor r9d, r9d
...                   ; through r15
mov rdx, [rcx]        ; b[0] into rdx (implicit MULX operand)
ADDMUL_FIRST_8        ; Seed: r8 = a[0]*b[0]_lo, carry into r9
mov QWORD PTR [rdi], r8  ; r8 is final, store to product buffer
ADDMUL_REST_8         ; Accumulate remaining products into r9-r15
; Repeat for b[1]..b[7]

ADDMUL macros

; ADDMUL_FIRST_8: seed multiplication for b[j] (a[0]*rdx into rbp:rbx)
ADDMUL_FIRST_8 MACRO
    xor eax, eax
    mulx rbx, rbp, [rsi]    ; rbp:rbx = a[0] * rdx (b[j])
    adox r8, rbp            ; r8 += rbp (preserves OF)
ENDM

; ADDMUL_REST_8: accumulate remaining a[1]..a[7] into r9..r15 via two carry chains
ADDMUL_REST_8 MACRO
    mulx rbp, rax, [rsi+8]   ; a[1]*rdx low half
    adcx r8, rbx             ; prior rbx into r8 via CF chain
    adox r9, rax             ; low half into r9 via OF chain
    mulx rbx, rax, [rsi+16]  ; a[2]*rdx
    adcx r9, rbp
    adox r10, rax
    ; ... same pattern unrolled for a[3]..a[7] into r11..r15
    ; final: adcx r15, rbx; adox r15, 0 to merge both carries
ENDM

; REDC_ITER: one reduction round (iter_idx = 0..7)
REDC_ITER MACRO iter_idx
    mov rax, r8
    imul rax, QWORD PTR [rsp+168]   ; rax = r8 * m_inv
    mov rdx, rax                     ; q = rax (implicit MULX operand)
    ADDMUL_FIRST_8                   ; T += q * m (first word)
    ADDMUL_REST_8                    ; T += q * m (remaining 7 words)
    add r15, QWORD PTR [rdi + (8 + iter_idx)*8]  ; merge upper product buffer
    jnc ri_nc
ENDM

MULX / ADCX / ADOX dual carry chains

BMI2's MULX does not affect CF/OF. ADX's ADCX / ADOX treat CF and OF as independent carry flags. This lets two carry chains run simultaneously, halving dependency chains. Standard technique for modern Intel/AMD Montgomery code.

No scratch allocation

Phase 1's 17-word product buffer (8 words for the multiply + 8 for REDC + 1 top) plus 48 bytes of spill area totals 184 bytes on the stack. No heap allocation, so the kernel keeps sub-microsecond latency required for crypto hot paths.

Dispatch

// IntModular.cpp: n==8 branch
if (n == 8 && mpn::detail::has_bmi2_adx()) {
    mpn_mont_mul_8(r, a, b, m, m_inv);
    return;
}

Benchmarks (Zen 3, MSVC Release x64)

Computes $x \mapsto x^e \bmod m$ where $e$ is a random exponent of the same bit length as $m$ (sliding-window exponentiation in Montgomery form: one square per bit plus one multiply when needed).

Size Operation calx (μs)
512 bits powerMod 31.9
1024 bits powerMod 261.8

The n=8 specialization is most effective at 512 bits. At 1024 bits the n=16 specialization is comparatively weaker.

6.6 Chinese Remainder Theorem

Given pairwise coprime moduli $m_1, m_2, \ldots, m_k$, the Chinese Remainder Theorem gives a one-to-one correspondence:

$$(a \bmod m_1, \ldots, a \bmod m_k) \leftrightarrow a \bmod M, \quad M = \prod m_i.$$

This map is a group homomorphism (commutes with add/sub/mul), so large-modulus computations can be split into $k$ smaller-modulus ones. calx uses CRT both for NTT prime combining and for stand-alone integer CRT used in RSA key generation and polynomial interpolation.

Garner's algorithm

An efficient way to reconstruct $a$ from residues $r_i = a \bmod m_i$:

$$\begin{aligned} v_1 &= r_1, \\ v_2 &= (r_2 - v_1) \cdot m_1^{-1} \bmod m_2, \\ v_3 &= (r_3 - v_1 - v_2 m_1) \cdot (m_1 m_2)^{-1} \bmod m_3, \\ &\vdots \end{aligned}$$

Finally $a = v_1 + v_2 m_1 + v_3 m_1 m_2 + \cdots$. The key advantage of Garner is that it does not build the full product $M = \prod m_i$ explicitly; each step accumulates $m_1 m_2 \cdots m_{i-1}$ incrementally, so only modular inverses with respect to the small $m_i$ are needed.

// Garner's algorithm (incremental version, pseudocode)
template <std::integral T>
T garner_crt(const std::vector<T>& remainders,
             const std::vector<T>& moduli) {
    std::vector<T> v(moduli.size());
    v[0] = remainders[0];
    for (size_t i = 1; i < moduli.size(); ++i) {
        T t = remainders[i];
        for (size_t j = 0; j < i; ++j) {
            t = ((t - v[j]) * mod_inverse(moduli[j], moduli[i])) % moduli[i];
            if (t < 0) t += moduli[i];
        }
        v[i] = t;
    }
    // Finally assemble a = v[0] + v[1]*m_0 + v[2]*m_0*m_1 + ... in multiprecision
    T a = v.back();
    for (size_t i = moduli.size() - 1; i-- > 0;) a = a * moduli[i] + v[i];
    return a;
}

A simpler Lagrange-style CRT that precomputes $M$ ($a = \sum r_i \cdot M_i \cdot (M_i^{-1} \bmod m_i) \bmod M$ with $M_i = M / m_i$) is also viable and performs comparably for small $k$. Choose based on whether building $M$ up front is cheaper than repeated small modular inverses.

Application: RSA-CRT

Computing RSA decryption $c^d \bmod n$ with $n = pq$ directly is slow. Instead compute $m_p = c^{d \bmod (p-1)} \bmod p$ and $m_q$ separately, then combine via CRT. Since moduli are half the size, Montgomery multiplication is ~4× faster. Standard optimization in all real RSA implementations.

6.7 Fast Modular Exponentiation — Sliding Window

The simplest way to compute $m^e \bmod n$ is to process the binary expansion of $e$ bit-by-bit (binary method): $O(\log e)$ multiplications.

$k$-bit window method

Processing $k$ consecutive bits at a time shortens the addition chain further. Pre-compute $m^1, m^3, m^5, \ldots, m^{2^k - 1}$ (odd powers), then at each window $w$:

$$\text{result} \leftarrow \text{result}^{2^k} \cdot m^w.$$

The calx window width selection:

// IntModular.cpp: choose_window_width
int w = 1;                          // e <= 24 bits
if (expBits > 24)  w = 3;           // <= 256 bits
if (expBits > 256) w = 4;           // <= 1024 bits
if (expBits > 1024) w = 5;          // <= 4096 bits
if (expBits > 4096) w = 7;          // larger

Sliding window

A fixed window still multiplies when the window starts with 0. A sliding window skips runs of zeros (only squaring in that interval) and aligns each window to start with a 1. On average this saves 20-30% of multiplications.

// IntModular.cpp: powerMod (sliding window, sketch)
int oddTableSize = 1 << (w - 1);   // pre-compute only odd powers
uint64_t g[oddTableSize];
g[0] = baseR;                       // m^1 (Montgomery form)
if (oddTableSize > 1) {
    uint64_t base2R = mont_mul(baseR, baseR);    // m^2
    for (int i = 1; i < oddTableSize; ++i)
        g[i] = mont_mul(g[i-1], base2R);         // m^3, m^5, m^7, ...
}
// Main loop: scan exponent bits left-to-right, detect windows, multiply

Since everything happens in Montgomery form, each mont_mul is division-free. The result is converted back at the end.

6.8 Constant-Time Implementation — powerModSec

In cryptographic use, if compute time depends on secret keys, a timing attack can recover them. Since Kocher's 1996 attack, real-world timing attacks have been reported periodically (e.g., CVE-2018-0737).

What affects timing?

  • Branches: if branch prediction misses, cycles vary.
  • Cache access patterns: if a lookup table index is secret, L1 hit/miss can vary by 10×.
  • Data-dependent instructions: on some CPUs, division and multiplication cycles depend on operand values.

Constant-time modular exponentiation (calx powerModSec)

The calx powerModSec changes three things from the ordinary powerMod:

  1. Fixed window (no sliding): always performs the same number of multiplications. If the window is 0, still multiplies by $m^0 = 1$.
  2. Masked table lookup: reads every table entry and selects via bitmask, so the cache access pattern never depends on secrets.
  3. Branch-free: uses arithmetic masks rather than ifs.
// IntModular.cpp: powerModSec constant-time table select
int tableSize = 1 << w;  // all 0..2^w-1 (not only odd)
std::vector<uint64_t> g_buf(tableSize * n);
// No sliding: all entries always ready

auto ct_select = [&](uint64_t* dst, int idx) {
    std::memset(dst, 0, n * sizeof(uint64_t));
    for (int i = 0; i < tableSize; ++i) {
        // i==idx yields 1, otherwise 0. Cast to int64_t and negate to get
        // 0xFFFF...FFFF (all ones) or 0x0000...0000 via two's-complement semantics.
        uint64_t mask = -(static_cast<int64_t>(i == idx));
        for (size_t j = 0; j < n; ++j)
            dst[j] |= g_buf[i * n + j] & mask;              // branch-free masked OR
    }
};

Performance tradeoff

powerModSec is 20-40% slower than powerMod because it skips sliding and reads all table entries. For RSA/ECDSA signing this slowdown is gladly accepted — secret-key leakage is far worse than a slightly slower signature.

Further mitigations

A fully constant-time implementation also considers:

  • Blinding: replace $m$ with $m r^e \bmod n$ for a random $r$, then divide by $r^{-1}$ at the end.
  • Cache-line aligned tables: align table entries so one line covers everything.
  • Power analysis (DPA) mitigations: for physical attacks, masked (secret-shared) computation is needed.

Since calx is not a cryptographic library, we do not implement blinding or DPA countermeasures. For real cryptographic use, rely on established libraries such as OpenSSL or libsodium.

6.9 Summary

  • Barrett: pre-compute $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications. Best for single-shot remainders.
  • Montgomery: use Montgomery form $\bar{a} = a R \bmod m$ to chain multiplications without division. REDC is the key. Dominates cryptographic/NTT hot paths.
  • CIOS: interleaved multiply-and-reduce pattern. calx's baseline. Dispatches to specialized assembly at n=4, 8, 16, 32.
  • mpn_mont_mul_8: 512-bit specialization. 8 accumulators resident in r8-r15. Dual carry chains with MULX/ADCX/ADOX.
  • CRT: parallel computation over coprime moduli, combined by Garner. RSA-CRT is standard optimization.
  • Sliding-window modular exponentiation: pre-compute odd powers, skip zero runs for 20-30% fewer multiplications.
  • Constant-time implementation: powerModSec uses fixed windows + masked selects. Standard for any crypto implementation.

The next chapter (Chapter 7: GCD) covers algorithms that repeatedly divide to compute the greatest common divisor — the Euclidean algorithm and its improvements (binary GCD, half-GCD).

References

  • Barrett, P. "Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor", CRYPTO '86, LNCS 263, pp. 311-323, 1987.
  • Montgomery, P.L. "Modular Multiplication without Trial Division", Mathematics of Computation, 44(170), pp. 519-521, 1985.
  • Menezes, A.J., van Oorschot, P.C., Vanstone, S.A. Handbook of Applied Cryptography, CRC Press, 1996. Chapter 14.
  • Brent, R.P., Zimmermann, P. Modern Computer Arithmetic, Cambridge University Press, 2010. Chapter 2.
  • Koç, Ç.K., Acar, T., Kaliski, B.S. "Analyzing and Comparing Montgomery Multiplication Algorithms", IEEE Micro, 16(3), pp. 26-33, 1996. Comparison of CIOS/SOS/FIOS.
  • Kocher, P.C. "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems", CRYPTO '96, LNCS 1109, pp. 104-113.
  • Bernstein, D.J. "Curve25519: New Diffie-Hellman Speed Records", PKC '06, LNCS 3958, pp. 207-228. Constant-time implementation in practice.

Related Chapters