Chapter 6: Modular Arithmetic

Barrett, Montgomery, CRT, and Modular Exponentiation

Modular reduction $a \bmod m$ is a special case of division, yet it has its own deep theory and optimization. RSA's $m^e \bmod n$, elliptic-curve point doubling, every NTT round, and Miller-Rabin primality testing all perform enormous numbers of modular multiplications. Whether a single modular multiplication takes $1\,\mu\text{s}$ or $0.1\,\mu\text{s}$ is what decides whether a cryptosystem is practical.

This chapter covers Barrett reduction, Montgomery multiplication, the Chinese Remainder Theorem, fast modular exponentiation, and constant-time implementations for cryptographic use. We follow the calx Montgomery implementation — in particular the n=8 (512-bit) assembly kernel — to see how far one can push a modern CPU. On 512-bit exponentiation, calx completes in roughly $32\,\mu\text{s}$.

6.1 Basics — Why Avoid Division?

Modular remainder $a \bmod m$ is by definition derived from the division $\lfloor a / m \rfloor$:

$$a \bmod m = a - m \cdot \lfloor a / m \rfloor.$$

As the previous chapter showed, multiprecision division is 2-5 times slower than multiplication. In cryptography and NTT, however, we often perform hundreds of millions of modular multiplications with the same modulus $m$. By pre-computing something about $m$, we can get the remainder with just 1-2 multiplications per call.

Goal: replace division with multiplication

Under the assumption of a fixed $m$, both Barrett and Montgomery pre-compute a reciprocal-like quantity of $m$, so that runtime reduces to two multiplications. Which to use:

Barrett: single-shot remainder, or remainder of a non-product $a$. Works for any modulus $m$.
Montgomery: when multiplication and reduction chain together, as in $(a \cdot b) \bmod m$. Requires odd $m$ (must be coprime to $R = \beta^n$). Breaks even with Barrett once you chain $k \ge 3$ multiplications.

6.2 Barrett Reduction

Barrett's idea (1986) was to use a pre-computed "pseudo-reciprocal" $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications.

The algorithm

If $m$ has $n$ words, take $k = 2n \cdot 64$:

$$\mu = \left\lfloor \frac{2^k}{m} \right\rfloor, \quad \hat{q} = \left\lfloor \frac{a \cdot \mu}{2^k} \right\rfloor, \quad r = a - m \cdot \hat{q}.$$

$\hat{q}$ is an approximation of the true quotient $\lfloor a/m \rfloor$ with error $\le 2$. Hence $r$ is at most about $2m$ too large, correctable by at most two subtractions.

Why is it fast?

$a \cdot \mu$ is $2n$-word-by-$n$-word so the full product is $3n$ words — roughly $2\times$ multiplication cost. But we only need the top $2n$ words, cutting that in half (Karatsuba-style truncated multiplication). Overwhelmingly lighter than $O(n^2)$ division.

Numerical example (small modulus)

$m = 7$, $k = 8$: $\mu = \lfloor 256 / 7 \rfloor = 36$. Reduce $a = 97$ modulo 7:

$$\hat{q} = \lfloor 97 \cdot 36 / 256 \rfloor = \lfloor 3492 / 256 \rfloor = 13,$$ $$r = 97 - 7 \cdot 13 = 97 - 91 = 6.$$

True answer $97 \bmod 7 = 6$ matches. A lucky case where no correction was needed.

Barrett and $\mu$-division

The $\mu$-div in the previous chapter applies Barrett's logic to unbalanced division. Barrett alone produces only the remainder; $\mu$-div returns both quotient and remainder.

6.3 Montgomery Multiplication

Montgomery's idea (1985) was to keep integers in a transformed Montgomery form $\bar{a} = a R \bmod m$ (with $R = 2^{64n}$) and perform multiplication + reduction in this form "without any division."

Montgomery form and REDC

Montgomery form

The Montgomery representation of $a$ is $\bar{a} = a R \bmod m$. Contrast with ordinary multiplication:

$$\bar{a} \cdot \bar{b} = a b R^2 \bmod m \neq (ab) R.$$

A naive multiplication leaves an extra factor of $R$. The operation that removes it is REDC:

$$\text{REDC}(T) = T R^{-1} \bmod m.$$

Then $\text{REDC}(\bar{a} \bar{b}) = (abR^2) R^{-1} = abR = \overline{ab}$, keeping the product in Montgomery form.

The REDC algorithm

This is the magic of Montgomery. Given $T$ (at most $2n$ words):

Compute $u = T \cdot (-m^{-1}) \bmod R$ (only the bottom $n$ words are needed).
Compute $T' = T + u \cdot m$. Then $T' \equiv T \pmod m$ and $T' \equiv 0 \pmod R$.
$T'' = T' / R$ is just a right shift (drop the bottom $n$ words).
If needed, subtract $m$ once from $T''$ to canonicalize.

$$T'' = \text{REDC}(T) = \frac{T + ((T \cdot (-m^{-1})) \bmod R) \cdot m}{R}.$$

The computation is two multiplications ($u$ and $u \cdot m$), an add, a shift, and at most one subtraction. Division hides in the pre-computed $m^{-1} \bmod R$. With $m$ fixed, $m^{-1}$ is computed only once.

Cost of input/output conversion

Converting an ordinary integer $a$ to Montgomery form costs a single Montgomery multiplication ($\bar{a} = aR \bmod m$), and the inverse is REDC($\bar{a} \cdot 1$). For $k$ multiplications in Montgomery space, the overhead is 2 conversions plus $k$ Montgomery multiplications. Beyond about $k \ge 3$ this beats ordinary reduction.

6.4 The CIOS Algorithm and calx Implementation

When implementing REDC, one can choose between SOS (Separated Operand Scanning — compute $T = a \cdot b$ fully, then reduce) and CIOS (Coarsely Integrated Operand Scanning — interleave multiplication and reduction). CIOS saves intermediate storage and is fastest at typical crypto sizes (n=4-64).

Structure of CIOS

The inner loop over each word $b_i$ of $b$ does:

Multiply step: $t \leftarrow t + a \cdot b_i$ (fixes one more word of $T$).
Reduce step: compute $q \leftarrow t_0 \cdot m_{\text{inv}} \bmod \beta$, then $t \leftarrow (t + q \cdot m) / \beta$ (zeroes out the bottom word).

After $n$ iterations, $t$ is the REDC result.

calx C++ baseline

// IntModular.cpp: mont_redc (generic CIOS)
// Note: mpn::addmul_1 returns the single-word carry out of the n-word addmul.
//       When tn > i+n, that carry must be propagated up through the higher
//       words (the inner while loop).
inline void mont_redc(uint64_t* r, uint64_t* t, size_t tn,
                      const uint64_t* m, size_t n, uint64_t m_inv) {
    for (size_t i = 0; i < n; ++i) {
        uint64_t q = t[i] * m_inv;                     // q = T[i] * (-m^{-1}) mod beta
        uint64_t carry = mpn::addmul_1(t + i, m, n, q); // T += q*m*beta^i (local result)
        // Propagate the single carry out of addmul_1 upward (only while non-zero)
        for (size_t j = i + n; carry && j < tn; ++j) {
            uint64_t sum = t[j] + carry;
            carry = (sum < t[j]) ? 1 : 0;
            t[j] = sum;
        }
    }
    std::memcpy(r, t + n, n * sizeof(uint64_t));
    if (mpn::cmp(r, n, m, n) >= 0) {
        mpn::sub(r, r, n, m, n);                        // normalize (T < m)
    }
}

The generic path is independent of $n$. For specific sizes (n=4, 8, 16, 32) we dispatch to hand-written assembly kernels.

6.5 Montgomery n=8 Specialization — calx Assembly

Use cases like RSA-4096 and ECDSA P-521 lead to 512-bit moduli (8 words). An assembly kernel specialized to this one size runs 30-50% faster than generic CIOS. The calx mpn_mont_mul_8 is organized into three phases.

Three-phase structure

mpn_mont_mul_8 structure

Phase 1: 8×8 full multiplication (SOS-style, stack-allocated 17-word product buffer).
Phase 2: 8 REDC iterations (compute $q_i = t_i \cdot m_{\text{inv}}$ and shrink $T$ by one word per iter).
Phase 3: conditional subtraction ($T \ge m$ then subtract $m$).

Accumulator register residency

The 8-word accumulator $r_8, r_9, \ldots, r_{15}$ sits in GPRs (r8-r15), significantly reducing memory traffic:

; Phase 1 skeleton from mpn_x64_mont.asm
xor r8d, r8d          ; Zero 8 accumulators
xor r9d, r9d
...                   ; through r15
mov rdx, [rcx]        ; b[0] into rdx (implicit MULX operand)
ADDMUL_FIRST_8        ; Seed: r8 = a[0]*b[0]_lo, carry into r9
mov QWORD PTR [rdi], r8  ; r8 is final, store to product buffer
ADDMUL_REST_8         ; Accumulate remaining products into r9-r15
; Repeat for b[1]..b[7]

ADDMUL macros

; ADDMUL_FIRST_8: seed multiplication for b[j] (a[0]*rdx into rbp:rbx)
ADDMUL_FIRST_8 MACRO
    xor eax, eax
    mulx rbx, rbp, [rsi]    ; rbp:rbx = a[0] * rdx (b[j])
    adox r8, rbp            ; r8 += rbp (preserves OF)
ENDM

; ADDMUL_REST_8: accumulate remaining a[1]..a[7] into r9..r15 via two carry chains
ADDMUL_REST_8 MACRO
    mulx rbp, rax, [rsi+8]   ; a[1]*rdx low half
    adcx r8, rbx             ; prior rbx into r8 via CF chain
    adox r9, rax             ; low half into r9 via OF chain
    mulx rbx, rax, [rsi+16]  ; a[2]*rdx
    adcx r9, rbp
    adox r10, rax
    ; ... same pattern unrolled for a[3]..a[7] into r11..r15
    ; final: adcx r15, rbx; adox r15, 0 to merge both carries
ENDM

; REDC_ITER: one reduction round (iter_idx = 0..7)
REDC_ITER MACRO iter_idx
    mov rax, r8
    imul rax, QWORD PTR [rsp+168]   ; rax = r8 * m_inv
    mov rdx, rax                     ; q = rax (implicit MULX operand)
    ADDMUL_FIRST_8                   ; T += q * m (first word)
    ADDMUL_REST_8                    ; T += q * m (remaining 7 words)
    add r15, QWORD PTR [rdi + (8 + iter_idx)*8]  ; merge upper product buffer
    jnc ri_nc
ENDM

MULX / ADCX / ADOX dual carry chains

BMI2's MULX does not affect CF/OF. ADX's ADCX / ADOX treat CF and OF as independent carry flags. This lets two carry chains run simultaneously, halving dependency chains. Standard technique for modern Intel/AMD Montgomery code.

No scratch allocation

Phase 1's 17-word product buffer (8 words for the multiply + 8 for REDC + 1 top) plus 48 bytes of spill area totals 184 bytes on the stack. No heap allocation, so the kernel keeps sub-microsecond latency required for crypto hot paths.

Dispatch

// IntModular.cpp: n==8 branch
if (n == 8 && mpn::detail::has_bmi2_adx()) {
    mpn_mont_mul_8(r, a, b, m, m_inv);
    return;
}

Benchmarks (Zen 3, MSVC Release x64)

Computes $x \mapsto x^e \bmod m$ where $e$ is a random exponent of the same bit length as $m$ (sliding-window exponentiation in Montgomery form: one square per bit plus one multiply when needed).

Size	Operation	calx (μs)
512 bits	powerMod	31.9
1024 bits	powerMod	261.8

The n=8 specialization is most effective at 512 bits. At 1024 bits the n=16 specialization is comparatively weaker.

6.6 Chinese Remainder Theorem

Given pairwise coprime moduli $m_1, m_2, \ldots, m_k$, the Chinese Remainder Theorem gives a one-to-one correspondence:

$$(a \bmod m_1, \ldots, a \bmod m_k) \leftrightarrow a \bmod M, \quad M = \prod m_i.$$

This map is a group homomorphism (commutes with add/sub/mul), so large-modulus computations can be split into $k$ smaller-modulus ones. calx uses CRT both for NTT prime combining and for stand-alone integer CRT used in RSA key generation and polynomial interpolation.

Garner's algorithm

An efficient way to reconstruct $a$ from residues $r_i = a \bmod m_i$:

$$\begin{aligned} v_1 &= r_1, \\ v_2 &= (r_2 - v_1) \cdot m_1^{-1} \bmod m_2, \\ v_3 &= (r_3 - v_1 - v_2 m_1) \cdot (m_1 m_2)^{-1} \bmod m_3, \\ &\vdots \end{aligned}$$

Finally $a = v_1 + v_2 m_1 + v_3 m_1 m_2 + \cdots$. The key advantage of Garner is that it does not build the full product $M = \prod m_i$ explicitly; each step accumulates $m_1 m_2 \cdots m_{i-1}$ incrementally, so only modular inverses with respect to the small $m_i$ are needed.

// Garner's algorithm (incremental version, pseudocode)
template <std::integral T>
T garner_crt(const std::vector<T>& remainders,
             const std::vector<T>& moduli) {
    std::vector<T> v(moduli.size());
    v[0] = remainders[0];
    for (size_t i = 1; i < moduli.size(); ++i) {
        T t = remainders[i];
        for (size_t j = 0; j < i; ++j) {
            t = ((t - v[j]) * mod_inverse(moduli[j], moduli[i])) % moduli[i];
            if (t < 0) t += moduli[i];
        }
        v[i] = t;
    }
    // Finally assemble a = v[0] + v[1]*m_0 + v[2]*m_0*m_1 + ... in multiprecision
    T a = v.back();
    for (size_t i = moduli.size() - 1; i-- > 0;) a = a * moduli[i] + v[i];
    return a;
}

A simpler Lagrange-style CRT that precomputes $M$ ($a = \sum r_i \cdot M_i \cdot (M_i^{-1} \bmod m_i) \bmod M$ with $M_i = M / m_i$) is also viable and performs comparably for small $k$. Choose based on whether building $M$ up front is cheaper than repeated small modular inverses.

Application: RSA-CRT

Computing RSA decryption $c^d \bmod n$ with $n = pq$ directly is slow. Instead compute $m_p = c^{d \bmod (p-1)} \bmod p$ and $m_q$ separately, then combine via CRT. Since moduli are half the size, Montgomery multiplication is ~4× faster. Standard optimization in all real RSA implementations.

6.7 Fast Modular Exponentiation — Sliding Window

The simplest way to compute $m^e \bmod n$ is to process the binary expansion of $e$ bit-by-bit (binary method): $O(\log e)$ multiplications.

$k$-bit window method

Processing $k$ consecutive bits at a time shortens the addition chain further. Pre-compute $m^1, m^3, m^5, \ldots, m^{2^k - 1}$ (odd powers), then at each window $w$:

$$\text{result} \leftarrow \text{result}^{2^k} \cdot m^w.$$

The calx window width selection:

// IntModular.cpp: choose_window_width
int w = 1;                          // e <= 24 bits
if (expBits > 24)  w = 3;           // <= 256 bits
if (expBits > 256) w = 4;           // <= 1024 bits
if (expBits > 1024) w = 5;          // <= 4096 bits
if (expBits > 4096) w = 7;          // larger

Sliding window

A fixed window still multiplies when the window starts with 0. A sliding window skips runs of zeros (only squaring in that interval) and aligns each window to start with a 1. On average this saves 20-30% of multiplications.

// IntModular.cpp: powerMod (sliding window, sketch)
int oddTableSize = 1 << (w - 1);   // pre-compute only odd powers
uint64_t g[oddTableSize];
g[0] = baseR;                       // m^1 (Montgomery form)
if (oddTableSize > 1) {
    uint64_t base2R = mont_mul(baseR, baseR);    // m^2
    for (int i = 1; i < oddTableSize; ++i)
        g[i] = mont_mul(g[i-1], base2R);         // m^3, m^5, m^7, ...
}
// Main loop: scan exponent bits left-to-right, detect windows, multiply

Since everything happens in Montgomery form, each mont_mul is division-free. The result is converted back at the end.

6.8 Constant-Time Implementation — powerModSec

In cryptographic use, if compute time depends on secret keys, a timing attack can recover them. Since Kocher's 1996 attack, real-world timing attacks have been reported periodically (e.g., CVE-2018-0737).

What affects timing?

Branches: if branch prediction misses, cycles vary.
Cache access patterns: if a lookup table index is secret, L1 hit/miss can vary by 10×.
Data-dependent instructions: on some CPUs, division and multiplication cycles depend on operand values.

Constant-time modular exponentiation (calx `powerModSec`)

The calx powerModSec changes three things from the ordinary powerMod:

Fixed window (no sliding): always performs the same number of multiplications. If the window is 0, still multiplies by $m^0 = 1$.
Masked table lookup: reads every table entry and selects via bitmask, so the cache access pattern never depends on secrets.
Branch-free: uses arithmetic masks rather than ifs.

// IntModular.cpp: powerModSec constant-time table select
int tableSize = 1 << w;  // all 0..2^w-1 (not only odd)
std::vector<uint64_t> g_buf(tableSize * n);
// No sliding: all entries always ready

auto ct_select = [&](uint64_t* dst, int idx) {
    std::memset(dst, 0, n * sizeof(uint64_t));
    for (int i = 0; i < tableSize; ++i) {
        // i==idx yields 1, otherwise 0. Cast to int64_t and negate to get
        // 0xFFFF...FFFF (all ones) or 0x0000...0000 via two's-complement semantics.
        uint64_t mask = -(static_cast<int64_t>(i == idx));
        for (size_t j = 0; j < n; ++j)
            dst[j] |= g_buf[i * n + j] & mask;              // branch-free masked OR
    }
};

Performance tradeoff

powerModSec is 20-40% slower than powerMod because it skips sliding and reads all table entries. For RSA/ECDSA signing this slowdown is gladly accepted — secret-key leakage is far worse than a slightly slower signature.

Further mitigations

A fully constant-time implementation also considers:

Blinding: replace $m$ with $m r^e \bmod n$ for a random $r$, then divide by $r^{-1}$ at the end.
Cache-line aligned tables: align table entries so one line covers everything.
Power analysis (DPA) mitigations: for physical attacks, masked (secret-shared) computation is needed.

Since calx is not a cryptographic library, we do not implement blinding or DPA countermeasures. For real cryptographic use, rely on established libraries such as OpenSSL or libsodium.

6.9 Summary

Barrett: pre-compute $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications. Best for single-shot remainders.
Montgomery: use Montgomery form $\bar{a} = a R \bmod m$ to chain multiplications without division. REDC is the key. Dominates cryptographic/NTT hot paths.
CIOS: interleaved multiply-and-reduce pattern. calx's baseline. Dispatches to specialized assembly at n=4, 8, 16, 32.
mpn_mont_mul_8: 512-bit specialization. 8 accumulators resident in r8-r15. Dual carry chains with MULX/ADCX/ADOX.
CRT: parallel computation over coprime moduli, combined by Garner. RSA-CRT is standard optimization.
Sliding-window modular exponentiation: pre-compute odd powers, skip zero runs for 20-30% fewer multiplications.
Constant-time implementation: powerModSec uses fixed windows + masked selects. Standard for any crypto implementation.

The next chapter (Chapter 7: GCD) covers algorithms that repeatedly divide to compute the greatest common divisor — the Euclidean algorithm and its improvements (binary GCD, half-GCD).

References

Barrett, P. "Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor", CRYPTO '86, LNCS 263, pp. 311-323, 1987.
Montgomery, P.L. "Modular Multiplication without Trial Division", Mathematics of Computation, 44(170), pp. 519-521, 1985.
Menezes, A.J., van Oorschot, P.C., Vanstone, S.A. Handbook of Applied Cryptography, CRC Press, 1996. Chapter 14.
Brent, R.P., Zimmermann, P. Modern Computer Arithmetic, Cambridge University Press, 2010. Chapter 2.
Koç, Ç.K., Acar, T., Kaliski, B.S. "Analyzing and Comparing Montgomery Multiplication Algorithms", IEEE Micro, 16(3), pp. 26-33, 1996. Comparison of CIOS/SOS/FIOS.
Kocher, P.C. "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems", CRYPTO '96, LNCS 1109, pp. 104-113.
Bernstein, D.J. "Curve25519: New Diffie-Hellman Speed Records", PKC '06, LNCS 3958, pp. 207-228. Constant-time implementation in practice.

Related Chapters

Chapter 5: Fast Division Algorithms — the naive implementation of modular reduction
Chapter 7: Integer GCD Algorithms — extended Euclidean and the computation of $m^{-1}$
Chapter 9: NTT Implementation Comparison — Montgomery multiplication in NTT
Chapter 10: Primality Testing — Miller-Rabin exponentiation
Chapter 11: Integer Factorization — modular arithmetic in Pollard rho and ECM