Cooley-Tukey FFT — Mixed-Radix Algorithms
Overview
Algorithms that evaluate a discrete Fourier transform (DFT) in $O(N \log N)$ are collectively called FFTs.
The dominant member of the family is the divide-and-conquer scheme rediscovered by Cooley and Tukey in 1965.
The sangi FFT class uses this scheme as its core and automatically switches between a SIMD radix-2 butterfly,
a mixed-radix butterfly, and a Bluestein fallback depending on the factorisation of $N$.
For API usage, see the FFT API reference.
DFT definition and complexity
The discrete Fourier transform of a length-$N$ complex sequence $x[0], \dots, x[N-1]$ is defined as
$$X[k] = \sum_{n=0}^{N-1} x[n] \, \omega_N^{kn}, \qquad \omega_N = e^{-2\pi i / N}, \qquad k = 0, \dots, N-1.$$
Computed by definition the sum costs $N$ complex multiplications and $N-1$ complex additions per output index, for a total of $\Theta(N^2)$ operations. That is about $10^6$ operations at $N = 1024$ and $10^{12}$ at $N = 2^{20}$, which quickly becomes impractical.
Cooley-Tukey factorises $N = N_1 N_2$ and replaces the length-$N$ DFT with $N_2$ DFTs of length $N_1$ together with $N_1$ DFTs of length $N_2$. Recursive application produces $\log N$ levels of $O(N)$ work each, cutting the total cost down to $O(N \log N)$.
Radix-2 Cooley-Tukey
When $N = 2^m$ we split the input into even and odd indices:
$$X[k] = \sum_{n=0}^{N/2-1} x[2n] \, \omega_N^{2nk} + \omega_N^{k} \sum_{n=0}^{N/2-1} x[2n+1] \, \omega_N^{2nk}.$$
Because $\omega_N^{2} = \omega_{N/2}$ this is a linear combination of two DFTs of length $N/2$,
$$X[k] = E[k] + \omega_N^{k} \, O[k], \qquad X[k+N/2] = E[k] - \omega_N^{k} \, O[k],$$
where $E$ and $O$ are the DFTs of the even and odd subsequences. The recurrence is expressed as a symmetric complex add-subtract pair known as the butterfly.
$$\begin{pmatrix} X[k] \\ X[k+N/2] \end{pmatrix} = \begin{pmatrix} 1 & \omega_N^k \\ 1 & -\omega_N^k \end{pmatrix} \begin{pmatrix} E[k] \\ O[k] \end{pmatrix}$$
Each level performs $N$ complex add/subtracts and $N/2$ complex multiplications. With $\log_2 N$ levels the total cost is about $\frac{N}{2} \log_2 N$ complex multiplications.
See also: Cooley-Tukey FFT
DIT and DIF
The radix-2 scheme admits two dual formulations.
- DIT (Decimation-In-Time): split the time-domain index $n$ by parity. The input is bit-reversed first and the output emerges in natural order.
- DIF (Decimation-In-Frequency): split the frequency-domain index $k$ into $k$ and $k+N/2$. Butterflies run on a naturally ordered input and the output is bit-reversed afterwards.
The arithmetic counts coincide, but the memory access patterns, twiddle ordering, and SIMD fit differ. A common trick is to pair DIT on the forward pass with DIF on the inverse so that the twiddle table is read in a consistent direction.
Bit-reversal permutation
DIT requires the input to be reordered in bit-reversed index order before the main loop. With $N = 2^m$, index $n = b_{m-1} b_{m-2} \cdots b_1 b_0$ in binary maps to $\tilde{n} = b_0 b_1 \cdots b_{m-1}$.
$$N = 8: \quad 0,1,2,3,4,5,6,7 \;\;\to\;\; 0,4,2,6,1,5,3,7$$
A naive bit-by-bit loop costs $O(N \log N)$, but Gold-Rader style schemes and incremental bit exchanges reduce this to $O(N)$. Pairing DIT (forward) with DIF (inverse) cancels the permutations at both ends, so they can be skipped when only the round-trip matters (e.g. convolution).
Radix-4 and radix-8
For $N = 4^m$ a 4-way split yields butterflies that are 4-point complex transforms:
$$\begin{pmatrix} X_0 \\ X_1 \\ X_2 \\ X_3 \end{pmatrix} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & -j & -1 & j \\ 1 & -1 & 1 & -1 \\ 1 & j & -1 & -j \end{pmatrix} \begin{pmatrix} E_0 \\ \omega^{k} E_1 \\ \omega^{2k} E_2 \\ \omega^{3k} E_3 \end{pmatrix}$$
The factors $\pm 1, \pm j$ collapse to bit swaps and sign flips, removing roughly 25% of the real multiplications compared with radix-2. Radix-8 introduces $\omega_8 = e^{-i\pi/4}$, giving constants $\pm 1, \pm j, \pm(1\pm j)/\sqrt{2}$ that mix multiplication and addition more deeply.
Higher radixes also cut the level count to $\log_4 N$ or $\log_8 N$, reducing memory traffic and loop overhead.
Split-radix
Discovered by Yavne (1968) and rediscovered by Duhamel and Hollmann (1984). It applies radix-2 to even indices and radix-4 to odd indices, giving an asymmetric split:
$$X[k] = E[k] + \omega_N^k Z_1[k] + \omega_N^{3k} Z_3[k].$$
Here $E$ is a DFT of length $N/2$, and $Z_1, Z_3$ are DFTs of length $N/4$. The real-multiplication count is $\frac{4}{3} N \log_2 N - \tfrac{38}{9} N + \cdots$, about 33% lower than pure radix-2 ($2 N \log_2 N$). Johnson and Frigo (2007) published a more compact form that preserves the same constants.
The arithmetic savings come at the cost of more complex loop structure that fits SIMD radix-2 less neatly. sangi uses an AVX2 radix-2 butterfly that operates on four complex numbers per vector for float (8-wide) and two complex numbers for double (4-wide), prioritising measured throughput over theoretical multiplication count.
Mixed radix (2, 3, 5)
Cooley-Tukey extends to non-power-of-two $N$ as long as $N$ factors into small primes. sangi handles $N = 2^a \cdot 3^b \cdot 5^c$ (5-smooth) directly through a mixed-radix path.
- Radix-3 butterfly: uses $\omega_3 = e^{-2\pi i / 3} = -\tfrac{1}{2} - \tfrac{\sqrt{3}}{2} i$. Four real multiplications and twelve real additions per 3-point transform.
- Radix-5 butterfly: uses $\omega_5$. Ten real multiplications and 34 real additions in the Winograd minimum-multiplication form.
The order in which the prime factors are processed matters: handling the largest factors first lets the innermost levels remain power-of-two and SIMD friendly (Singleton 1967, Bergland 1968). sangi clusters the radix-2 levels into AVX2 batches and processes the radix-3 and radix-5 levels in scalar code.
FFT::is_valid_size(n) reports whether $n$ is 5-smooth, and
FFT::next_valid_size(n) returns the smallest 5-smooth integer $\geq n$.
Twiddle-factor accuracy
Each butterfly multiplies by $\omega_N^k = \cos(2\pi k / N) - i \sin(2\pi k / N)$, called the twiddle factor. Three implementation strategies exist.
- Direct evaluation: call
std::cosandstd::sinat every use. The most accurate, but the slowest. - Precomputed table: compute $\{\omega_N^0, \dots, \omega_N^{N-1}\}$ once and cache them. $O(N)$ memory and $O(1)$ access.
- Recurrence: keep $w_k$ and update $w_{k+1} = w_k \cdot w_1$. Constant memory but accumulating rounding error.
The relative error of the recurrence grows like $O(m \varepsilon)$ over a length $2^m$ transform, where
$\varepsilon$ is the unit roundoff. Long transforms therefore require Tang's half-angle formula (1989) or
Buneman's recurrence to keep the error bounded. sangi defaults to a precomputed table and refreshes the boundary
entries via direct cos/sin evaluation.
In-place and strided transforms
A radix-2 butterfly reads two elements and writes two elements, so the whole FFT can proceed in place.
The sangi FFT::fft(std::span<T>) overwrites its argument directly.
Multi-dimensional FFTs (FFT-multidim) apply 1D transforms along each axis, but axes other than the innermost run with stride $N_x$ and lose unit-stride access. Two strategies are common:
- Strided butterflies: bake the stride into pointer arithmetic. Simpler code, weaker cache behaviour.
- Transpose + contiguous FFT: transpose the data so each axis becomes innermost, then run a unit-stride 1D FFT. Block transposition keeps the operation cache-friendly.
sangi fft2d and fft3d use the transpose path and run each 1D FFT through the
SIMD radix-2 kernel on contiguous memory.
Dispatch in sangi
FFT<T, Real>::fft(data) dispatches as follows based on $N = $ data.size().
| $N$ | Path | Notes |
|---|---|---|
| $2^k$ | SIMD radix-2 (AVX2 / SSE) | Optimised float / double path |
| $2^a \cdot 3^b \cdot 5^c$ (5-smooth) | Mixed radix-{2,3,5} | is_valid_size returns true |
| Otherwise | Bluestein (chirp z) | Pads internally to a power-of-two $\geq 2N - 1$ |
ModularInt<P> | NTT path | Radix-2 butterflies over a residue ring |
The inverse transform applies the same butterflies with conjugated twiddles plus $1/N$ normalisation.
Use the FFTEngine class to choose a normalisation convention through
FFTDivMode (None / Forward / Inverse / Both).
Bluestein and arbitrary-length transforms are covered in FFT-bluestein-real, which also discusses real-input packing. NTT details live in FFT-ntt-bigint, and multi-dimensional FFTs and convolution in FFT-multidim.
References
- Cooley, J. W. and Tukey, J. W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series". Mathematics of Computation, 19, 297–301.
- Singleton, R. C. (1969). "An Algorithm for Computing the Mixed Radix Fast Fourier Transform". IEEE Trans. Audio Electroacoust., 17, 93–103.
- Duhamel, P. and Hollmann, H. (1984). "Split-radix FFT algorithm". Electronics Letters, 20, 14–16.
- Johnson, S. G. and Frigo, M. (2007). "A Modified Split-Radix FFT With Fewer Arithmetic Operations". IEEE Trans. Signal Process., 55, 111–119.
- Van Loan, C. (1992). Computational Frameworks for the Fast Fourier Transform. SIAM.