Bluestein Arbitrary-Length FFT and Real FFT
Overview
The Cooley-Tukey FFT covers any $N$ that factors into small primes, but $N$ that contains a large prime factor falls outside the mixed-radix framework. This article describes Bluestein, Rader, and the prime factor algorithm (PFA), which all handle arbitrary $N$ in $O(N \log N)$, and the RealFFT path that halves the compute and memory cost for real-valued input.
Bluestein (chirp z transform)
Bluestein (1968) decomposes the DFT exponent as
$$kn = \frac{k^2 + n^2 - (k - n)^2}{2}.$$
Substituting this identity rewrites $X[k]$ as a chirp multiplication followed by a one-dimensional convolution:
$$X[k] = \omega_N^{-k^2/2} \sum_{n=0}^{N-1} \bigl( x[n] \, \omega_N^{-n^2/2} \bigr) \, \omega_N^{(k-n)^2/2}.$$
Setting $a[n] = x[n] \, \omega_N^{-n^2/2}$ and $b[m] = \omega_N^{m^2/2}$, the inner sum is the linear convolution $a \ast b$. Padding both sequences to length $M \geq 2N - 1$ turns the linear convolution into a circular one, which can be evaluated with an $M$-point FFT. Total cost: $O(N \log N)$ for any integer $N$.
The factor $\omega_N^{m^2/2} = e^{-i\pi m^2 / N}$ has a phase that varies quadratically in $m$, like a frequency-swept chirp signal. For this reason the method is often called the chirp z transform.
See also: Bluestein FFT (chirp z)
Rader's algorithm
Rader (1968) handles prime $N$ by exploiting the fact that the multiplicative group $\mathbb{Z}/N\mathbb{Z}^*$ is cyclic of order $N - 1$. Reindexing by a primitive root $g$ rearranges the length-$N$ DFT into a cyclic convolution of length $N - 1$.
If $N - 1$ itself has small prime factors (often 5-smooth), the convolution evaluates efficiently with a mixed-radix FFT. Rader avoids the padding required by Bluestein and uses a smaller internal FFT size, so it wins when the factorisation of $N - 1$ is favourable.
Prime factor algorithm (PFA / Good-Thomas)
Good (1958) and Thomas (1963) noted that if $N = N_1 N_2$ with $\gcd(N_1, N_2) = 1$, the Chinese remainder theorem reindexes the DFT into an independent two-dimensional transform without any twiddle factors between stages:
$$X[k_1, k_2] = \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1} x[n_1, n_2] \, \omega_{N_1}^{k_1 n_1} \, \omega_{N_2}^{k_2 n_2}.$$
Without the inter-stage twiddles the multiplication count drops, but the coprimality constraint restricts which sizes are eligible, and the input/output index remapping is more involved than in Cooley-Tukey. On modern SIMD hardware twiddle multiplications are cheap, so PFA is used selectively.
Conjugate symmetry of real inputs
For a real-valued input $x[n] \in \mathbb{R}$, the DFT satisfies
$$X[N - k] = \overline{X[k]}, \qquad k = 1, \dots, N - 1.$$
The negative-frequency bins are conjugates of the positive ones, so the spectrum carries only half as much independent information. $X[0]$ and (when $N$ is even) $X[N/2]$ are real, while the remaining $N/2 - 1$ bins contribute two real degrees of freedom each. The total is $N$ real numbers, matching the input.
RealFFT — N to N/2 + 1 packing
Exploiting the conjugate symmetry, a real-input FFT can store its output as a $N/2 + 1$ complex array,
halving both compute and memory. sangi RealFFT returns this packed form directly.
Routing real input through a complex FFT
The standard technique runs a single length-$N/2$ complex FFT followed by an $O(N)$ post-processing step.
- Pair samples as $z[k] = x[2k] + i \, x[2k+1]$ for $k = 0, \dots, N/2 - 1$.
- Compute the length-$N/2$ complex FFT $Z = \mathrm{FFT}(z)$.
- Recover $X[k]$ as a linear combination of $Z[k]$ and $Z[N/2 - k]$: $$X[k] = \tfrac{1}{2}(Z[k] + \overline{Z[N/2-k]}) - \tfrac{i}{2} \omega_N^k (Z[k] - \overline{Z[N/2-k]}).$$
$X[0]$ and $X[N/2]$ are handled separately as real values. The whole pipeline costs about half what a same-length complex FFT would.
Inverse transform
The inverse retraces the packing in reverse. If the input does not satisfy conjugate symmetry the result
will not be real, so RealFFT::ifft checks that the argument size equals
output_size / 2 + 1 and raises MathError otherwise.
Packing trick (two signals at once)
Combine two real signals as $z[n] = x[n] + i \, y[n]$ and run one complex FFT $Z[k] = \mathrm{FFT}(z)$. Conjugate symmetry recovers both spectra:
$$X[k] = \tfrac{1}{2}\bigl(Z[k] + \overline{Z[N-k]}\bigr), \qquad Y[k] = \tfrac{1}{2i}\bigl(Z[k] - \overline{Z[N-k]}\bigr).$$
The cost is slightly higher than two RealFFTs, but only one complex kernel needs to live in cache,
which is attractive for two-channel synchronous workloads. The sangi
simd_fft::stereoFFT applies this separation to interleaved
stereo PCM in a single FFT call.
Hermitian FFT and DCT/DST
The output of a real FFT is Hermitian. The discrete cosine and sine transforms are DFTs of even or odd extensions of the input, so they reduce to real FFTs of length $2N$ or $4N$, or via Makhoul's construction to a real FFT of length $N$ plus $O(N)$ post-processing.
- DCT-II: $X[k] = \sum_{n=0}^{N-1} x[n] \cos\!\bigl(\frac{\pi(2n+1)k}{2N}\bigr)$. Makhoul (1980) evaluates it with one length-$N$ real FFT plus $O(N)$ corrections.
- DST-II: the sine-basis sibling; reduces analogously.
- DCT-III / DST-III: inverse transforms with parallel structure.
sangi dct / idct takes the Makhoul path
in $O(N \log N)$ when $N$ is 5-smooth and falls back to the direct $O(N^2)$ sum otherwise.
dst / idst currently uses the direct formula.
Numerical accuracy of Bluestein
Bluestein leans heavily on the chirp factor $\omega_N^{n^2/2} = e^{-i\pi n^2 / N}$.
Evaluating $n^2$ directly in double precision wastes significant bits and can leave the relative
error one or two digits worse than Cooley-Tukey. Mitigations:
- Modular integer argument: compute $n^2 \bmod 2N$ in integer arithmetic before invoking the trigonometric functions. The phase stays in $[0, 2\pi)$ and benefits from accurate range reduction.
- Precomputed chirp table: tabulate $b[m] = \omega_N^{m^2/2}$ for the full range to avoid repeated evaluation.
- Internal FFT padding: choose $M \geq 2N - 1$ rounded to a 5-smooth size so that the circular convolution mirrors the linear convolution.
With those fixes Bluestein attains the same order of relative error as Cooley-Tukey (Rabiner et al. 1969, Smith 1995).
Behaviour in sangi
sangi FFT::fft(data) branches as follows:
FFT::is_valid_size(N)istrue($N$ is 5-smooth) — Cooley-Tukey mixed radix.- Otherwise — Bluestein, which internally executes three FFTs of length
$M = $
next_valid_size(2N - 1).
The Bluestein path allocates a working buffer roughly twice the size of the input, so peak memory exceeds
the Cooley-Tukey path. If the same size is reused often, build an FFTEngine so the twiddle and
chirp tables are cached across calls.
Use RealFFT<Real> for real-valued input. It runs a length-$N/2$ complex FFT plus
$O(N)$ post-processing, roughly twice as fast as a same-length complex FFT.
References
- Bluestein, L. I. (1970). "A linear filtering approach to the computation of discrete Fourier transform". IEEE Trans. Audio Electroacoust., 18, 451–455.
- Rader, C. M. (1968). "Discrete Fourier transforms when the number of data samples is prime". Proc. IEEE, 56, 1107–1108.
- Good, I. J. (1958). "The Interaction Algorithm and Practical Fourier Analysis". J. Royal Stat. Soc. B, 20, 361–372.
- Makhoul, J. (1980). "A fast cosine transform in one and two dimensions". IEEE Trans. ASSP, 28, 27–34.
- Sorensen, H. V., Jones, D. L., Heideman, M. T., and Burrus, C. S. (1987). "Real-valued fast Fourier transform algorithms". IEEE Trans. ASSP, 35, 849–863.