Chapter 6: Modular Arithmetic
Barrett, Montgomery, CRT, and Modular Exponentiation
Modular reduction $a \bmod m$ is a special case of division, yet it has its own deep theory and optimization. RSA's $m^e \bmod n$, elliptic-curve point doubling, every NTT round, and Miller-Rabin primality testing all perform enormous numbers of modular multiplications. Whether a single modular multiplication takes $1\,\mu\text{s}$ or $0.1\,\mu\text{s}$ is what decides whether a cryptosystem is practical.
This chapter covers Barrett reduction, Montgomery multiplication, the Chinese Remainder Theorem, fast modular exponentiation, and constant-time implementations for cryptographic use. We follow the calx Montgomery implementation — in particular the n=8 (512-bit) assembly kernel — to see how far one can push a modern CPU. On 512-bit exponentiation, calx completes in roughly $32\,\mu\text{s}$.
6.1 Basics — Why Avoid Division?
Modular remainder $a \bmod m$ is by definition derived from the division $\lfloor a / m \rfloor$:
As the previous chapter showed, multiprecision division is 2-5 times slower than multiplication. In cryptography and NTT, however, we often perform hundreds of millions of modular multiplications with the same modulus $m$. By pre-computing something about $m$, we can get the remainder with just 1-2 multiplications per call.
Goal: replace division with multiplication
Under the assumption of a fixed $m$, both Barrett and Montgomery pre-compute a reciprocal-like quantity of $m$, so that runtime reduces to two multiplications. Which to use:
- Barrett: single-shot remainder, or remainder of a non-product $a$. Works for any modulus $m$.
- Montgomery: when multiplication and reduction chain together, as in $(a \cdot b) \bmod m$. Requires odd $m$ (must be coprime to $R = \beta^n$). Breaks even with Barrett once you chain $k \ge 3$ multiplications.
6.2 Barrett Reduction
Barrett's idea (1986) was to use a pre-computed "pseudo-reciprocal" $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications.
The algorithm
If $m$ has $n$ words, take $k = 2n \cdot 64$:
$\hat{q}$ is an approximation of the true quotient $\lfloor a/m \rfloor$ with error $\le 2$. Hence $r$ is at most about $2m$ too large, correctable by at most two subtractions.
Why is it fast?
$a \cdot \mu$ is $2n$-word-by-$n$-word so the full product is $3n$ words — roughly $2\times$ multiplication cost. But we only need the top $2n$ words, cutting that in half (Karatsuba-style truncated multiplication). Overwhelmingly lighter than $O(n^2)$ division.
Numerical example (small modulus)
$m = 7$, $k = 8$: $\mu = \lfloor 256 / 7 \rfloor = 36$. Reduce $a = 97$ modulo 7:
True answer $97 \bmod 7 = 6$ matches. A lucky case where no correction was needed.
Barrett and $\mu$-division
The $\mu$-div in the previous chapter applies Barrett's logic to unbalanced division. Barrett alone produces only the remainder; $\mu$-div returns both quotient and remainder.
6.3 Montgomery Multiplication
Montgomery's idea (1985) was to keep integers in a transformed Montgomery form $\bar{a} = a R \bmod m$ (with $R = 2^{64n}$) and perform multiplication + reduction in this form "without any division."
Montgomery form and REDC
Montgomery form
The Montgomery representation of $a$ is $\bar{a} = a R \bmod m$. Contrast with ordinary multiplication:
A naive multiplication leaves an extra factor of $R$. The operation that removes it is REDC:
Then $\text{REDC}(\bar{a} \bar{b}) = (abR^2) R^{-1} = abR = \overline{ab}$, keeping the product in Montgomery form.
The REDC algorithm
This is the magic of Montgomery. Given $T$ (at most $2n$ words):
- Compute $u = T \cdot (-m^{-1}) \bmod R$ (only the bottom $n$ words are needed).
- Compute $T' = T + u \cdot m$. Then $T' \equiv T \pmod m$ and $T' \equiv 0 \pmod R$.
- $T'' = T' / R$ is just a right shift (drop the bottom $n$ words).
- If needed, subtract $m$ once from $T''$ to canonicalize.
The computation is two multiplications ($u$ and $u \cdot m$), an add, a shift, and at most one subtraction. Division hides in the pre-computed $m^{-1} \bmod R$. With $m$ fixed, $m^{-1}$ is computed only once.
Cost of input/output conversion
Converting an ordinary integer $a$ to Montgomery form costs a single Montgomery multiplication ($\bar{a} = aR \bmod m$), and the inverse is REDC($\bar{a} \cdot 1$). For $k$ multiplications in Montgomery space, the overhead is 2 conversions plus $k$ Montgomery multiplications. Beyond about $k \ge 3$ this beats ordinary reduction.
6.4 The CIOS Algorithm and calx Implementation
When implementing REDC, one can choose between SOS (Separated Operand Scanning — compute $T = a \cdot b$ fully, then reduce) and CIOS (Coarsely Integrated Operand Scanning — interleave multiplication and reduction). CIOS saves intermediate storage and is fastest at typical crypto sizes (n=4-64).
Structure of CIOS
The inner loop over each word $b_i$ of $b$ does:
- Multiply step: $t \leftarrow t + a \cdot b_i$ (fixes one more word of $T$).
- Reduce step: compute $q \leftarrow t_0 \cdot m_{\text{inv}} \bmod \beta$, then $t \leftarrow (t + q \cdot m) / \beta$ (zeroes out the bottom word).
After $n$ iterations, $t$ is the REDC result.
calx C++ baseline
// IntModular.cpp: mont_redc (generic CIOS)
// Note: mpn::addmul_1 returns the single-word carry out of the n-word addmul.
// When tn > i+n, that carry must be propagated up through the higher
// words (the inner while loop).
inline void mont_redc(uint64_t* r, uint64_t* t, size_t tn,
const uint64_t* m, size_t n, uint64_t m_inv) {
for (size_t i = 0; i < n; ++i) {
uint64_t q = t[i] * m_inv; // q = T[i] * (-m^{-1}) mod beta
uint64_t carry = mpn::addmul_1(t + i, m, n, q); // T += q*m*beta^i (local result)
// Propagate the single carry out of addmul_1 upward (only while non-zero)
for (size_t j = i + n; carry && j < tn; ++j) {
uint64_t sum = t[j] + carry;
carry = (sum < t[j]) ? 1 : 0;
t[j] = sum;
}
}
std::memcpy(r, t + n, n * sizeof(uint64_t));
if (mpn::cmp(r, n, m, n) >= 0) {
mpn::sub(r, r, n, m, n); // normalize (T < m)
}
}
The generic path is independent of $n$. For specific sizes (n=4, 8, 16, 32) we dispatch to hand-written assembly kernels.
6.5 Montgomery n=8 Specialization — calx Assembly
Use cases like RSA-4096 and ECDSA P-521 lead to 512-bit moduli (8 words).
An assembly kernel specialized to this one size runs 30-50% faster than generic CIOS.
The calx mpn_mont_mul_8 is organized into three phases.
Three-phase structure
mpn_mont_mul_8 structure
- Phase 1: 8×8 full multiplication (SOS-style, stack-allocated 17-word product buffer).
- Phase 2: 8 REDC iterations (compute $q_i = t_i \cdot m_{\text{inv}}$ and shrink $T$ by one word per iter).
- Phase 3: conditional subtraction ($T \ge m$ then subtract $m$).
Accumulator register residency
The 8-word accumulator $r_8, r_9, \ldots, r_{15}$ sits in GPRs (r8-r15), significantly reducing memory traffic:
; Phase 1 skeleton from mpn_x64_mont.asm
xor r8d, r8d ; Zero 8 accumulators
xor r9d, r9d
... ; through r15
mov rdx, [rcx] ; b[0] into rdx (implicit MULX operand)
ADDMUL_FIRST_8 ; Seed: r8 = a[0]*b[0]_lo, carry into r9
mov QWORD PTR [rdi], r8 ; r8 is final, store to product buffer
ADDMUL_REST_8 ; Accumulate remaining products into r9-r15
; Repeat for b[1]..b[7]
ADDMUL macros
; ADDMUL_FIRST_8: seed multiplication for b[j] (a[0]*rdx into rbp:rbx)
ADDMUL_FIRST_8 MACRO
xor eax, eax
mulx rbx, rbp, [rsi] ; rbp:rbx = a[0] * rdx (b[j])
adox r8, rbp ; r8 += rbp (preserves OF)
ENDM
; ADDMUL_REST_8: accumulate remaining a[1]..a[7] into r9..r15 via two carry chains
ADDMUL_REST_8 MACRO
mulx rbp, rax, [rsi+8] ; a[1]*rdx low half
adcx r8, rbx ; prior rbx into r8 via CF chain
adox r9, rax ; low half into r9 via OF chain
mulx rbx, rax, [rsi+16] ; a[2]*rdx
adcx r9, rbp
adox r10, rax
; ... same pattern unrolled for a[3]..a[7] into r11..r15
; final: adcx r15, rbx; adox r15, 0 to merge both carries
ENDM
; REDC_ITER: one reduction round (iter_idx = 0..7)
REDC_ITER MACRO iter_idx
mov rax, r8
imul rax, QWORD PTR [rsp+168] ; rax = r8 * m_inv
mov rdx, rax ; q = rax (implicit MULX operand)
ADDMUL_FIRST_8 ; T += q * m (first word)
ADDMUL_REST_8 ; T += q * m (remaining 7 words)
add r15, QWORD PTR [rdi + (8 + iter_idx)*8] ; merge upper product buffer
jnc ri_nc
ENDM
MULX / ADCX / ADOX dual carry chains
BMI2's MULX does not affect CF/OF.
ADX's ADCX / ADOX treat CF and OF
as independent carry flags. This lets two carry chains run simultaneously,
halving dependency chains. Standard technique for modern Intel/AMD Montgomery code.
No scratch allocation
Phase 1's 17-word product buffer (8 words for the multiply + 8 for REDC + 1 top) plus 48 bytes of spill area totals 184 bytes on the stack. No heap allocation, so the kernel keeps sub-microsecond latency required for crypto hot paths.
Dispatch
// IntModular.cpp: n==8 branch
if (n == 8 && mpn::detail::has_bmi2_adx()) {
mpn_mont_mul_8(r, a, b, m, m_inv);
return;
}
Benchmarks (Zen 3, MSVC Release x64)
Computes $x \mapsto x^e \bmod m$ where $e$ is a random exponent of the same bit length as $m$ (sliding-window exponentiation in Montgomery form: one square per bit plus one multiply when needed).
| Size | Operation | calx (μs) |
|---|---|---|
| 512 bits | powerMod | 31.9 |
| 1024 bits | powerMod | 261.8 |
The n=8 specialization is most effective at 512 bits. At 1024 bits the n=16 specialization is comparatively weaker.
6.6 Chinese Remainder Theorem
Given pairwise coprime moduli $m_1, m_2, \ldots, m_k$, the Chinese Remainder Theorem gives a one-to-one correspondence:
This map is a group homomorphism (commutes with add/sub/mul), so large-modulus computations can be split into $k$ smaller-modulus ones. calx uses CRT both for NTT prime combining and for stand-alone integer CRT used in RSA key generation and polynomial interpolation.
Garner's algorithm
An efficient way to reconstruct $a$ from residues $r_i = a \bmod m_i$:
Finally $a = v_1 + v_2 m_1 + v_3 m_1 m_2 + \cdots$. The key advantage of Garner is that it does not build the full product $M = \prod m_i$ explicitly; each step accumulates $m_1 m_2 \cdots m_{i-1}$ incrementally, so only modular inverses with respect to the small $m_i$ are needed.
// Garner's algorithm (incremental version, pseudocode)
template <std::integral T>
T garner_crt(const std::vector<T>& remainders,
const std::vector<T>& moduli) {
std::vector<T> v(moduli.size());
v[0] = remainders[0];
for (size_t i = 1; i < moduli.size(); ++i) {
T t = remainders[i];
for (size_t j = 0; j < i; ++j) {
t = ((t - v[j]) * mod_inverse(moduli[j], moduli[i])) % moduli[i];
if (t < 0) t += moduli[i];
}
v[i] = t;
}
// Finally assemble a = v[0] + v[1]*m_0 + v[2]*m_0*m_1 + ... in multiprecision
T a = v.back();
for (size_t i = moduli.size() - 1; i-- > 0;) a = a * moduli[i] + v[i];
return a;
}
A simpler Lagrange-style CRT that precomputes $M$ ($a = \sum r_i \cdot M_i \cdot (M_i^{-1} \bmod m_i) \bmod M$ with $M_i = M / m_i$) is also viable and performs comparably for small $k$. Choose based on whether building $M$ up front is cheaper than repeated small modular inverses.
Application: RSA-CRT
Computing RSA decryption $c^d \bmod n$ with $n = pq$ directly is slow. Instead compute $m_p = c^{d \bmod (p-1)} \bmod p$ and $m_q$ separately, then combine via CRT. Since moduli are half the size, Montgomery multiplication is ~4× faster. Standard optimization in all real RSA implementations.
6.7 Fast Modular Exponentiation — Sliding Window
The simplest way to compute $m^e \bmod n$ is to process the binary expansion of $e$ bit-by-bit (binary method): $O(\log e)$ multiplications.
$k$-bit window method
Processing $k$ consecutive bits at a time shortens the addition chain further. Pre-compute $m^1, m^3, m^5, \ldots, m^{2^k - 1}$ (odd powers), then at each window $w$:
The calx window width selection:
// IntModular.cpp: choose_window_width
int w = 1; // e <= 24 bits
if (expBits > 24) w = 3; // <= 256 bits
if (expBits > 256) w = 4; // <= 1024 bits
if (expBits > 1024) w = 5; // <= 4096 bits
if (expBits > 4096) w = 7; // larger
Sliding window
A fixed window still multiplies when the window starts with 0. A sliding window skips runs of zeros (only squaring in that interval) and aligns each window to start with a 1. On average this saves 20-30% of multiplications.
// IntModular.cpp: powerMod (sliding window, sketch)
int oddTableSize = 1 << (w - 1); // pre-compute only odd powers
uint64_t g[oddTableSize];
g[0] = baseR; // m^1 (Montgomery form)
if (oddTableSize > 1) {
uint64_t base2R = mont_mul(baseR, baseR); // m^2
for (int i = 1; i < oddTableSize; ++i)
g[i] = mont_mul(g[i-1], base2R); // m^3, m^5, m^7, ...
}
// Main loop: scan exponent bits left-to-right, detect windows, multiply
Since everything happens in Montgomery form, each mont_mul
is division-free. The result is converted back at the end.
6.8 Constant-Time Implementation — powerModSec
In cryptographic use, if compute time depends on secret keys, a timing attack can recover them. Since Kocher's 1996 attack, real-world timing attacks have been reported periodically (e.g., CVE-2018-0737).
What affects timing?
- Branches: if branch prediction misses, cycles vary.
- Cache access patterns: if a lookup table index is secret, L1 hit/miss can vary by 10×.
- Data-dependent instructions: on some CPUs, division and multiplication cycles depend on operand values.
Constant-time modular exponentiation (calx powerModSec)
The calx powerModSec changes three things from the ordinary powerMod:
- Fixed window (no sliding): always performs the same number of multiplications. If the window is 0, still multiplies by $m^0 = 1$.
- Masked table lookup: reads every table entry and selects via bitmask, so the cache access pattern never depends on secrets.
- Branch-free: uses arithmetic masks rather than ifs.
// IntModular.cpp: powerModSec constant-time table select
int tableSize = 1 << w; // all 0..2^w-1 (not only odd)
std::vector<uint64_t> g_buf(tableSize * n);
// No sliding: all entries always ready
auto ct_select = [&](uint64_t* dst, int idx) {
std::memset(dst, 0, n * sizeof(uint64_t));
for (int i = 0; i < tableSize; ++i) {
// i==idx yields 1, otherwise 0. Cast to int64_t and negate to get
// 0xFFFF...FFFF (all ones) or 0x0000...0000 via two's-complement semantics.
uint64_t mask = -(static_cast<int64_t>(i == idx));
for (size_t j = 0; j < n; ++j)
dst[j] |= g_buf[i * n + j] & mask; // branch-free masked OR
}
};
Performance tradeoff
powerModSec is 20-40% slower than powerMod
because it skips sliding and reads all table entries.
For RSA/ECDSA signing this slowdown is gladly accepted — secret-key leakage
is far worse than a slightly slower signature.
Further mitigations
A fully constant-time implementation also considers:
- Blinding: replace $m$ with $m r^e \bmod n$ for a random $r$, then divide by $r^{-1}$ at the end.
- Cache-line aligned tables: align table entries so one line covers everything.
- Power analysis (DPA) mitigations: for physical attacks, masked (secret-shared) computation is needed.
Since calx is not a cryptographic library, we do not implement blinding or DPA countermeasures. For real cryptographic use, rely on established libraries such as OpenSSL or libsodium.
6.9 Summary
- Barrett: pre-compute $\mu = \lfloor 2^k / m \rfloor$ to replace division with two multiplications. Best for single-shot remainders.
- Montgomery: use Montgomery form $\bar{a} = a R \bmod m$ to chain multiplications without division. REDC is the key. Dominates cryptographic/NTT hot paths.
- CIOS: interleaved multiply-and-reduce pattern. calx's baseline. Dispatches to specialized assembly at n=4, 8, 16, 32.
- mpn_mont_mul_8: 512-bit specialization. 8 accumulators resident in r8-r15. Dual carry chains with MULX/ADCX/ADOX.
- CRT: parallel computation over coprime moduli, combined by Garner. RSA-CRT is standard optimization.
- Sliding-window modular exponentiation: pre-compute odd powers, skip zero runs for 20-30% fewer multiplications.
- Constant-time implementation:
powerModSecuses fixed windows + masked selects. Standard for any crypto implementation.
The next chapter (Chapter 7: GCD) covers algorithms that repeatedly divide to compute the greatest common divisor — the Euclidean algorithm and its improvements (binary GCD, half-GCD).
References
- Barrett, P. "Implementing the Rivest Shamir and Adleman Public Key Encryption Algorithm on a Standard Digital Signal Processor", CRYPTO '86, LNCS 263, pp. 311-323, 1987.
- Montgomery, P.L. "Modular Multiplication without Trial Division", Mathematics of Computation, 44(170), pp. 519-521, 1985.
- Menezes, A.J., van Oorschot, P.C., Vanstone, S.A. Handbook of Applied Cryptography, CRC Press, 1996. Chapter 14.
- Brent, R.P., Zimmermann, P. Modern Computer Arithmetic, Cambridge University Press, 2010. Chapter 2.
- Koç, Ç.K., Acar, T., Kaliski, B.S. "Analyzing and Comparing Montgomery Multiplication Algorithms", IEEE Micro, 16(3), pp. 26-33, 1996. Comparison of CIOS/SOS/FIOS.
- Kocher, P.C. "Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems", CRYPTO '96, LNCS 1109, pp. 104-113.
- Bernstein, D.J. "Curve25519: New Diffie-Hellman Speed Records", PKC '06, LNCS 3958, pp. 207-228. Constant-time implementation in practice.
Related Chapters
- Chapter 5: Fast Division Algorithms — the naive implementation of modular reduction
- Chapter 7: Integer GCD Algorithms — extended Euclidean and the computation of $m^{-1}$
- Chapter 9: NTT Implementation Comparison — Montgomery multiplication in NTT
- Chapter 10: Primality Testing — Miller-Rabin exponentiation
- Chapter 11: Integer Factorization — modular arithmetic in Pollard rho and ECM