Why does the Cooley-Tukey FFT achieve O(N log N) complexity?

It decomposes a length-N DFT into smaller DFTs of length N₁ and N₂ with N = N₁·N₂, applied recursively. Each level performs O(N) work, the recursion has depth log N, so the total cost is O(N log N). For N=1024 this is roughly 100x faster than the direct O(N²) sum, and the gap widens to about 50,000x at N=2²⁰.

What is the difference between DIT and DIF?

DIT (Decimation-In-Time) bit-reverses the input first and produces a naturally ordered output, while DIF (Decimation-In-Frequency) applies butterflies to naturally ordered input and bit-reverses the output. The arithmetic cost is identical, but the memory access patterns differ and influence how easily the algorithm vectorizes and fits cache.

Split-radix uses radix-2 for the even-indexed sub-DFT and radix-4 for the odd-indexed ones. The real-multiplication count is roughly 33 percent lower than pure radix-2 and for many years held the best known asymptotic constant for power-of-two sizes. sangi prioritises SIMD throughput and uses an AVX2 radix-2 butterfly instead.

How does sangi FFT pick an algorithm at runtime?

Power-of-two N uses SIMD radix-2 (AVX2/SSE). 5-smooth N (N = 2ᵃ·3ᵇ·5ᶜ) uses mixed-radix butterflies. Any other length falls back to Bluestein's chirp z-transform. FFT::is_valid_size and FFT::next_valid_size expose the size predicate.

Is recurrence more accurate than direct evaluation for twiddle factors?

Direct evaluation with cos and sin is more accurate. The recurrence wₖ₊₁ = wₖ·w₁ runs in O(1) per step but accumulates rounding error, with relative error growing roughly linearly in log N. sangi precomputes a twiddle table and uses direct evaluation only at boundary points to bound the error.

Cooley-Tukey FFT — Mixed-Radix Algorithms

Overview

Algorithms that evaluate a discrete Fourier transform (DFT) in $O(N \log N)$ are collectively called FFTs. The dominant member of the family is the divide-and-conquer scheme rediscovered by Cooley and Tukey in 1965. The sangi FFT class uses this scheme as its core and automatically switches between a SIMD radix-2 butterfly, a mixed-radix butterfly, and a Bluestein fallback depending on the factorisation of $N$.

For API usage, see the FFT API reference.

DFT definition and complexity

The discrete Fourier transform of a length-$N$ complex sequence $x[0], \dots, x[N-1]$ is defined as

$$X[k] = \sum_{n=0}^{N-1} x[n] \, \omega_N^{kn}, \qquad \omega_N = e^{-2\pi i / N}, \qquad k = 0, \dots, N-1.$$

Computed by definition the sum costs $N$ complex multiplications and $N-1$ complex additions per output index, for a total of $\Theta(N^2)$ operations. That is about $10^6$ operations at $N = 1024$ and $10^{12}$ at $N = 2^{20}$, which quickly becomes impractical.

Cooley-Tukey factorises $N = N_1 N_2$ and replaces the length-$N$ DFT with $N_2$ DFTs of length $N_1$ together with $N_1$ DFTs of length $N_2$. Recursive application produces $\log N$ levels of $O(N)$ work each, cutting the total cost down to $O(N \log N)$.

Radix-2 Cooley-Tukey

When $N = 2^m$ we split the input into even and odd indices:

$$X[k] = \sum_{n=0}^{N/2-1} x[2n] \, \omega_N^{2nk} + \omega_N^{k} \sum_{n=0}^{N/2-1} x[2n+1] \, \omega_N^{2nk}.$$

Because $\omega_N^{2} = \omega_{N/2}$ this is a linear combination of two DFTs of length $N/2$,

$$X[k] = E[k] + \omega_N^{k} \, O[k], \qquad X[k+N/2] = E[k] - \omega_N^{k} \, O[k],$$

where $E$ and $O$ are the DFTs of the even and odd subsequences. The recurrence is expressed as a symmetric complex add-subtract pair known as the butterfly.

$$\begin{pmatrix} X[k] \\ X[k+N/2] \end{pmatrix} = \begin{pmatrix} 1 & \omega_N^k \\ 1 & -\omega_N^k \end{pmatrix} \begin{pmatrix} E[k] \\ O[k] \end{pmatrix}$$

Each level performs $N$ complex add/subtracts and $N/2$ complex multiplications. With $\log_2 N$ levels the total cost is about $\frac{N}{2} \log_2 N$ complex multiplications.

DIT and DIF

The radix-2 scheme admits two dual formulations.

DIT (Decimation-In-Time): split the time-domain index $n$ by parity. The input is bit-reversed first and the output emerges in natural order.
DIF (Decimation-In-Frequency): split the frequency-domain index $k$ into $k$ and $k+N/2$. Butterflies run on a naturally ordered input and the output is bit-reversed afterwards.

The arithmetic counts coincide, but the memory access patterns, twiddle ordering, and SIMD fit differ. A common trick is to pair DIT on the forward pass with DIF on the inverse so that the twiddle table is read in a consistent direction.

Bit-reversal permutation

DIT requires the input to be reordered in bit-reversed index order before the main loop. With $N = 2^m$, index $n = b_{m-1} b_{m-2} \cdots b_1 b_0$ in binary maps to $\tilde{n} = b_0 b_1 \cdots b_{m-1}$.

$$N = 8: \quad 0,1,2,3,4,5,6,7 \;\;\to\;\; 0,4,2,6,1,5,3,7$$

A naive bit-by-bit loop costs $O(N \log N)$, but Gold-Rader style schemes and incremental bit exchanges reduce this to $O(N)$. Pairing DIT (forward) with DIF (inverse) cancels the permutations at both ends, so they can be skipped when only the round-trip matters (e.g. convolution).

Radix-4 and radix-8

For $N = 4^m$ a 4-way split yields butterflies that are 4-point complex transforms:

$$\begin{pmatrix} X_0 \\ X_1 \\ X_2 \\ X_3 \end{pmatrix} = \begin{pmatrix} 1 & 1 & 1 & 1 \\ 1 & -j & -1 & j \\ 1 & -1 & 1 & -1 \\ 1 & j & -1 & -j \end{pmatrix} \begin{pmatrix} E_0 \\ \omega^{k} E_1 \\ \omega^{2k} E_2 \\ \omega^{3k} E_3 \end{pmatrix}$$

The factors $\pm 1, \pm j$ collapse to bit swaps and sign flips, removing roughly 25% of the real multiplications compared with radix-2. Radix-8 introduces $\omega_8 = e^{-i\pi/4}$, giving constants $\pm 1, \pm j, \pm(1\pm j)/\sqrt{2}$ that mix multiplication and addition more deeply.

Higher radixes also cut the level count to $\log_4 N$ or $\log_8 N$, reducing memory traffic and loop overhead.

Split-radix

Discovered by Yavne (1968) and rediscovered by Duhamel and Hollmann (1984). It applies radix-2 to even indices and radix-4 to odd indices, giving an asymmetric split:

$$X[k] = E[k] + \omega_N^k Z_1[k] + \omega_N^{3k} Z_3[k].$$

Here $E$ is a DFT of length $N/2$, and $Z_1, Z_3$ are DFTs of length $N/4$. The real-multiplication count is $\frac{4}{3} N \log_2 N - \tfrac{38}{9} N + \cdots$, about 33% lower than pure radix-2 ($2 N \log_2 N$). Johnson and Frigo (2007) published a more compact form that preserves the same constants.

The arithmetic savings come at the cost of more complex loop structure that fits SIMD radix-2 less neatly. sangi uses an AVX2 radix-2 butterfly that operates on four complex numbers per vector for float (8-wide) and two complex numbers for double (4-wide), prioritising measured throughput over theoretical multiplication count.

Mixed radix (2, 3, 5)

Cooley-Tukey extends to non-power-of-two $N$ as long as $N$ factors into small primes. sangi handles $N = 2^a \cdot 3^b \cdot 5^c$ (5-smooth) directly through a mixed-radix path.

Radix-3 butterfly: uses $\omega_3 = e^{-2\pi i / 3} = -\tfrac{1}{2} - \tfrac{\sqrt{3}}{2} i$. Four real multiplications and twelve real additions per 3-point transform.
Radix-5 butterfly: uses $\omega_5$. Ten real multiplications and 34 real additions in the Winograd minimum-multiplication form.

The order in which the prime factors are processed matters: handling the largest factors first lets the innermost levels remain power-of-two and SIMD friendly (Singleton 1967, Bergland 1968). sangi clusters the radix-2 levels into AVX2 batches and processes the radix-3 and radix-5 levels in scalar code.

FFT::is_valid_size(n) reports whether $n$ is 5-smooth, and FFT::next_valid_size(n) returns the smallest 5-smooth integer $\geq n$.

Twiddle-factor accuracy

Each butterfly multiplies by $\omega_N^k = \cos(2\pi k / N) - i \sin(2\pi k / N)$, called the twiddle factor. Three implementation strategies exist.

Direct evaluation: call std::cos and std::sin at every use. The most accurate, but the slowest.
Precomputed table: compute $\{\omega_N^0, \dots, \omega_N^{N-1}\}$ once and cache them. $O(N)$ memory and $O(1)$ access.
Recurrence: keep $w_k$ and update $w_{k+1} = w_k \cdot w_1$. Constant memory but accumulating rounding error.

The relative error of the recurrence grows like $O(m \varepsilon)$ over a length $2^m$ transform, where $\varepsilon$ is the unit roundoff. Long transforms therefore require Tang's half-angle formula (1989) or Buneman's recurrence to keep the error bounded. sangi defaults to a precomputed table and refreshes the boundary entries via direct cos/sin evaluation.

In-place and strided transforms

A radix-2 butterfly reads two elements and writes two elements, so the whole FFT can proceed in place. The sangi FFT::fft(std::span<T>) overwrites its argument directly.

Multi-dimensional FFTs (FFT-multidim) apply 1D transforms along each axis, but axes other than the innermost run with stride $N_x$ and lose unit-stride access. Two strategies are common:

Strided butterflies: bake the stride into pointer arithmetic. Simpler code, weaker cache behaviour.
Transpose + contiguous FFT: transpose the data so each axis becomes innermost, then run a unit-stride 1D FFT. Block transposition keeps the operation cache-friendly.

sangi fft2d and fft3d use the transpose path and run each 1D FFT through the SIMD radix-2 kernel on contiguous memory.

Dispatch in sangi

FFT<T, Real>::fft(data) dispatches as follows based on $N = $ data.size().

$N$	Path	Notes
$2^k$	SIMD radix-2 (AVX2 / SSE)	Optimised `float` / `double` path
$2^a \cdot 3^b \cdot 5^c$ (5-smooth)	Mixed radix-{2,3,5}	`is_valid_size` returns `true`
Otherwise	Bluestein (chirp z)	Pads internally to a power-of-two $\geq 2N - 1$
`ModularInt<P>`	NTT path	Radix-2 butterflies over a residue ring

The inverse transform applies the same butterflies with conjugated twiddles plus $1/N$ normalisation. Use the FFTEngine class to choose a normalisation convention through FFTDivMode (None / Forward / Inverse / Both).

Bluestein and arbitrary-length transforms are covered in FFT-bluestein-real, which also discusses real-input packing. NTT details live in FFT-ntt-bigint, and multi-dimensional FFTs and convolution in FFT-multidim.

References

Cooley, J. W. and Tukey, J. W. (1965). "An Algorithm for the Machine Calculation of Complex Fourier Series". Mathematics of Computation, 19, 297–301.
Singleton, R. C. (1969). "An Algorithm for Computing the Mixed Radix Fast Fourier Transform". IEEE Trans. Audio Electroacoust., 17, 93–103.
Duhamel, P. and Hollmann, H. (1984). "Split-radix FFT algorithm". Electronics Letters, 20, 14–16.
Johnson, S. G. and Frigo, M. (2007). "A Modified Split-Radix FFT With Fewer Arithmetic Operations". IEEE Trans. Signal Process., 55, 111–119.
Van Loan, C. (1992). Computational Frameworks for the Fast Fourier Transform. SIAM.