Bluestein Arbitrary-Length FFT and Real FFT

Overview

The Cooley-Tukey FFT covers any $N$ that factors into small primes, but $N$ that contains a large prime factor falls outside the mixed-radix framework. This article describes Bluestein, Rader, and the prime factor algorithm (PFA), which all handle arbitrary $N$ in $O(N \log N)$, and the RealFFT path that halves the compute and memory cost for real-valued input.

The API is exposed as RealFFT and FFT.

Bluestein (chirp z transform)

Bluestein (1968) decomposes the DFT exponent as

$$kn = \frac{k^2 + n^2 - (k - n)^2}{2}.$$

Substituting this identity rewrites $X[k]$ as a chirp multiplication followed by a one-dimensional convolution:

$$X[k] = \omega_N^{-k^2/2} \sum_{n=0}^{N-1} \bigl( x[n] \, \omega_N^{-n^2/2} \bigr) \, \omega_N^{(k-n)^2/2}.$$

Setting $a[n] = x[n] \, \omega_N^{-n^2/2}$ and $b[m] = \omega_N^{m^2/2}$, the inner sum is the linear convolution $a \ast b$. Padding both sequences to length $M \geq 2N - 1$ turns the linear convolution into a circular one, which can be evaluated with an $M$-point FFT. Total cost: $O(N \log N)$ for any integer $N$.

The factor $\omega_N^{m^2/2} = e^{-i\pi m^2 / N}$ has a phase that varies quadratically in $m$, like a frequency-swept chirp signal. For this reason the method is often called the chirp z transform.

Rader's algorithm

Rader (1968) handles prime $N$ by exploiting the fact that the multiplicative group $\mathbb{Z}/N\mathbb{Z}^*$ is cyclic of order $N - 1$. Reindexing by a primitive root $g$ rearranges the length-$N$ DFT into a cyclic convolution of length $N - 1$.

If $N - 1$ itself has small prime factors (often 5-smooth), the convolution evaluates efficiently with a mixed-radix FFT. Rader avoids the padding required by Bluestein and uses a smaller internal FFT size, so it wins when the factorisation of $N - 1$ is favourable.

Prime factor algorithm (PFA / Good-Thomas)

Good (1958) and Thomas (1963) noted that if $N = N_1 N_2$ with $\gcd(N_1, N_2) = 1$, the Chinese remainder theorem reindexes the DFT into an independent two-dimensional transform without any twiddle factors between stages:

$$X[k_1, k_2] = \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1} x[n_1, n_2] \, \omega_{N_1}^{k_1 n_1} \, \omega_{N_2}^{k_2 n_2}.$$

Without the inter-stage twiddles the multiplication count drops, but the coprimality constraint restricts which sizes are eligible, and the input/output index remapping is more involved than in Cooley-Tukey. On modern SIMD hardware twiddle multiplications are cheap, so PFA is used selectively.

Conjugate symmetry of real inputs

For a real-valued input $x[n] \in \mathbb{R}$, the DFT satisfies

$$X[N - k] = \overline{X[k]}, \qquad k = 1, \dots, N - 1.$$

The negative-frequency bins are conjugates of the positive ones, so the spectrum carries only half as much independent information. $X[0]$ and (when $N$ is even) $X[N/2]$ are real, while the remaining $N/2 - 1$ bins contribute two real degrees of freedom each. The total is $N$ real numbers, matching the input.

RealFFT — N to N/2 + 1 packing

Exploiting the conjugate symmetry, a real-input FFT can store its output as a $N/2 + 1$ complex array, halving both compute and memory. sangi RealFFT returns this packed form directly.

Routing real input through a complex FFT

The standard technique runs a single length-$N/2$ complex FFT followed by an $O(N)$ post-processing step.

  1. Pair samples as $z[k] = x[2k] + i \, x[2k+1]$ for $k = 0, \dots, N/2 - 1$.
  2. Compute the length-$N/2$ complex FFT $Z = \mathrm{FFT}(z)$.
  3. Recover $X[k]$ as a linear combination of $Z[k]$ and $Z[N/2 - k]$: $$X[k] = \tfrac{1}{2}(Z[k] + \overline{Z[N/2-k]}) - \tfrac{i}{2} \omega_N^k (Z[k] - \overline{Z[N/2-k]}).$$

$X[0]$ and $X[N/2]$ are handled separately as real values. The whole pipeline costs about half what a same-length complex FFT would.

Inverse transform

The inverse retraces the packing in reverse. If the input does not satisfy conjugate symmetry the result will not be real, so RealFFT::ifft checks that the argument size equals output_size / 2 + 1 and raises MathError otherwise.

Packing trick (two signals at once)

Combine two real signals as $z[n] = x[n] + i \, y[n]$ and run one complex FFT $Z[k] = \mathrm{FFT}(z)$. Conjugate symmetry recovers both spectra:

$$X[k] = \tfrac{1}{2}\bigl(Z[k] + \overline{Z[N-k]}\bigr), \qquad Y[k] = \tfrac{1}{2i}\bigl(Z[k] - \overline{Z[N-k]}\bigr).$$

The cost is slightly higher than two RealFFTs, but only one complex kernel needs to live in cache, which is attractive for two-channel synchronous workloads. The sangi simd_fft::stereoFFT applies this separation to interleaved stereo PCM in a single FFT call.

Hermitian FFT and DCT/DST

The output of a real FFT is Hermitian. The discrete cosine and sine transforms are DFTs of even or odd extensions of the input, so they reduce to real FFTs of length $2N$ or $4N$, or via Makhoul's construction to a real FFT of length $N$ plus $O(N)$ post-processing.

  • DCT-II: $X[k] = \sum_{n=0}^{N-1} x[n] \cos\!\bigl(\frac{\pi(2n+1)k}{2N}\bigr)$. Makhoul (1980) evaluates it with one length-$N$ real FFT plus $O(N)$ corrections.
  • DST-II: the sine-basis sibling; reduces analogously.
  • DCT-III / DST-III: inverse transforms with parallel structure.

sangi dct / idct takes the Makhoul path in $O(N \log N)$ when $N$ is 5-smooth and falls back to the direct $O(N^2)$ sum otherwise. dst / idst currently uses the direct formula.

Numerical accuracy of Bluestein

Bluestein leans heavily on the chirp factor $\omega_N^{n^2/2} = e^{-i\pi n^2 / N}$. Evaluating $n^2$ directly in double precision wastes significant bits and can leave the relative error one or two digits worse than Cooley-Tukey. Mitigations:

  • Modular integer argument: compute $n^2 \bmod 2N$ in integer arithmetic before invoking the trigonometric functions. The phase stays in $[0, 2\pi)$ and benefits from accurate range reduction.
  • Precomputed chirp table: tabulate $b[m] = \omega_N^{m^2/2}$ for the full range to avoid repeated evaluation.
  • Internal FFT padding: choose $M \geq 2N - 1$ rounded to a 5-smooth size so that the circular convolution mirrors the linear convolution.

With those fixes Bluestein attains the same order of relative error as Cooley-Tukey (Rabiner et al. 1969, Smith 1995).

Behaviour in sangi

sangi FFT::fft(data) branches as follows:

  • FFT::is_valid_size(N) is true ($N$ is 5-smooth) — Cooley-Tukey mixed radix.
  • Otherwise — Bluestein, which internally executes three FFTs of length $M = $ next_valid_size(2N - 1).

The Bluestein path allocates a working buffer roughly twice the size of the input, so peak memory exceeds the Cooley-Tukey path. If the same size is reused often, build an FFTEngine so the twiddle and chirp tables are cached across calls.

Use RealFFT<Real> for real-valued input. It runs a length-$N/2$ complex FFT plus $O(N)$ post-processing, roughly twice as fast as a same-length complex FFT.

References

  • Bluestein, L. I. (1970). "A linear filtering approach to the computation of discrete Fourier transform". IEEE Trans. Audio Electroacoust., 18, 451–455.
  • Rader, C. M. (1968). "Discrete Fourier transforms when the number of data samples is prime". Proc. IEEE, 56, 1107–1108.
  • Good, I. J. (1958). "The Interaction Algorithm and Practical Fourier Analysis". J. Royal Stat. Soc. B, 20, 361–372.
  • Makhoul, J. (1980). "A fast cosine transform in one and two dimensions". IEEE Trans. ASSP, 28, 27–34.
  • Sorensen, H. V., Jones, D. L., Heideman, M. T., and Burrus, C. S. (1987). "Real-valued fast Fourier transform algorithms". IEEE Trans. ASSP, 35, 849–863.