What problem does Bluestein's algorithm solve?

It reduces an arbitrary-length DFT to a power-of-two FFT. The identity kn = (k² + n² − (k − n)²) / 2 rewrites the DFT as a chirp multiplication followed by an ordinary linear convolution, which can be evaluated with a power-of-two FFT of length ≥ 2N − 1. The total cost stays O(N log N) even when N is prime or has large prime factors.

What is Rader's algorithm?

When N is prime the multiplicative group Z/NZ* is cyclic, and reindexing by a primitive root turns the length-N DFT into a cyclic convolution of length N − 1. Rader is competitive with Bluestein when N − 1 is itself highly composite, because the convolution length is smaller and no padding is required.

Why does a real FFT only need N/2 + 1 bins?

The DFT X[k] of a real input satisfies X[N − k] = conj(X[k]), so the upper half is redundant. The independent values are X[0] through X[N/2] (the endpoints are real), giving N/2 + 1 complex bins or N real numbers total — exactly matching the degrees of freedom of the input.

What is the packing trick?

Pack two real signals as one complex sequence z[n] = x[n] + j y[n], run a single complex FFT, and recover X[k] and Y[k] from conjugate symmetry. The trick lets you transform a pair of real signals at the cost of one complex FFT, useful in batch and stereo processing.

Is Bluestein less accurate than a regular FFT?

Naive implementations are slightly less accurate because they evaluate the chirp factor e^(iπn²/N) directly. Computing n² modulo 2N in integer arithmetic before calling the trigonometric functions, and padding the internal FFT to ≥ 2N rounded up to a 5-smooth size to avoid circular wraparound, brings the relative error within the same order as Cooley-Tukey. sangi applies both fixes.

Bluestein Arbitrary-Length FFT and Real FFT

Overview

The Cooley-Tukey FFT covers any $N$ that factors into small primes, but $N$ that contains a large prime factor falls outside the mixed-radix framework. This article describes Bluestein, Rader, and the prime factor algorithm (PFA), which all handle arbitrary $N$ in $O(N \log N)$, and the RealFFT path that halves the compute and memory cost for real-valued input.

The API is exposed as RealFFT and FFT.

Bluestein (chirp z transform)

Bluestein (1968) decomposes the DFT exponent as

$$kn = \frac{k^2 + n^2 - (k - n)^2}{2}.$$

Substituting this identity rewrites $X[k]$ as a chirp multiplication followed by a one-dimensional convolution:

$$X[k] = \omega_N^{-k^2/2} \sum_{n=0}^{N-1} \bigl( x[n] \, \omega_N^{-n^2/2} \bigr) \, \omega_N^{(k-n)^2/2}.$$

Setting $a[n] = x[n] \, \omega_N^{-n^2/2}$ and $b[m] = \omega_N^{m^2/2}$, the inner sum is the linear convolution $a \ast b$. Padding both sequences to length $M \geq 2N - 1$ turns the linear convolution into a circular one, which can be evaluated with an $M$-point FFT. Total cost: $O(N \log N)$ for any integer $N$.

The factor $\omega_N^{m^2/2} = e^{-i\pi m^2 / N}$ has a phase that varies quadratically in $m$, like a frequency-swept chirp signal. For this reason the method is often called the chirp z transform.

Rader's algorithm

Rader (1968) handles prime $N$ by exploiting the fact that the multiplicative group $\mathbb{Z}/N\mathbb{Z}^*$ is cyclic of order $N - 1$. Reindexing by a primitive root $g$ rearranges the length-$N$ DFT into a cyclic convolution of length $N - 1$.

If $N - 1$ itself has small prime factors (often 5-smooth), the convolution evaluates efficiently with a mixed-radix FFT. Rader avoids the padding required by Bluestein and uses a smaller internal FFT size, so it wins when the factorisation of $N - 1$ is favourable.

Prime factor algorithm (PFA / Good-Thomas)

Good (1958) and Thomas (1963) noted that if $N = N_1 N_2$ with $\gcd(N_1, N_2) = 1$, the Chinese remainder theorem reindexes the DFT into an independent two-dimensional transform without any twiddle factors between stages:

$$X[k_1, k_2] = \sum_{n_1=0}^{N_1-1} \sum_{n_2=0}^{N_2-1} x[n_1, n_2] \, \omega_{N_1}^{k_1 n_1} \, \omega_{N_2}^{k_2 n_2}.$$

Without the inter-stage twiddles the multiplication count drops, but the coprimality constraint restricts which sizes are eligible, and the input/output index remapping is more involved than in Cooley-Tukey. On modern SIMD hardware twiddle multiplications are cheap, so PFA is used selectively.

Conjugate symmetry of real inputs

For a real-valued input $x[n] \in \mathbb{R}$, the DFT satisfies

$$X[N - k] = \overline{X[k]}, \qquad k = 1, \dots, N - 1.$$

The negative-frequency bins are conjugates of the positive ones, so the spectrum carries only half as much independent information. $X[0]$ and (when $N$ is even) $X[N/2]$ are real, while the remaining $N/2 - 1$ bins contribute two real degrees of freedom each. The total is $N$ real numbers, matching the input.

RealFFT — N to N/2 + 1 packing

Exploiting the conjugate symmetry, a real-input FFT can store its output as a $N/2 + 1$ complex array, halving both compute and memory. sangi RealFFT returns this packed form directly.

Routing real input through a complex FFT

The standard technique runs a single length-$N/2$ complex FFT followed by an $O(N)$ post-processing step.

Pair samples as $z[k] = x[2k] + i \, x[2k+1]$ for $k = 0, \dots, N/2 - 1$.
Compute the length-$N/2$ complex FFT $Z = \mathrm{FFT}(z)$.
Recover $X[k]$ as a linear combination of $Z[k]$ and $Z[N/2 - k]$: $$X[k] = \tfrac{1}{2}(Z[k] + \overline{Z[N/2-k]}) - \tfrac{i}{2} \omega_N^k (Z[k] - \overline{Z[N/2-k]}).$$

$X[0]$ and $X[N/2]$ are handled separately as real values. The whole pipeline costs about half what a same-length complex FFT would.

Inverse transform

The inverse retraces the packing in reverse. If the input does not satisfy conjugate symmetry the result will not be real, so RealFFT::ifft checks that the argument size equals output_size / 2 + 1 and raises MathError otherwise.

Packing trick (two signals at once)

Combine two real signals as $z[n] = x[n] + i \, y[n]$ and run one complex FFT $Z[k] = \mathrm{FFT}(z)$. Conjugate symmetry recovers both spectra:

$$X[k] = \tfrac{1}{2}\bigl(Z[k] + \overline{Z[N-k]}\bigr), \qquad Y[k] = \tfrac{1}{2i}\bigl(Z[k] - \overline{Z[N-k]}\bigr).$$

The cost is slightly higher than two RealFFTs, but only one complex kernel needs to live in cache, which is attractive for two-channel synchronous workloads. The sangi simd_fft::stereoFFT applies this separation to interleaved stereo PCM in a single FFT call.

Hermitian FFT and DCT/DST

The output of a real FFT is Hermitian. The discrete cosine and sine transforms are DFTs of even or odd extensions of the input, so they reduce to real FFTs of length $2N$ or $4N$, or via Makhoul's construction to a real FFT of length $N$ plus $O(N)$ post-processing.

DCT-II: $X[k] = \sum_{n=0}^{N-1} x[n] \cos\!\bigl(\frac{\pi(2n+1)k}{2N}\bigr)$. Makhoul (1980) evaluates it with one length-$N$ real FFT plus $O(N)$ corrections.
DST-II: the sine-basis sibling; reduces analogously.
DCT-III / DST-III: inverse transforms with parallel structure.

sangi dct / idct takes the Makhoul path in $O(N \log N)$ when $N$ is 5-smooth and falls back to the direct $O(N^2)$ sum otherwise. dst / idst currently uses the direct formula.

Numerical accuracy of Bluestein

Bluestein leans heavily on the chirp factor $\omega_N^{n^2/2} = e^{-i\pi n^2 / N}$. Evaluating $n^2$ directly in double precision wastes significant bits and can leave the relative error one or two digits worse than Cooley-Tukey. Mitigations:

Modular integer argument: compute $n^2 \bmod 2N$ in integer arithmetic before invoking the trigonometric functions. The phase stays in $[0, 2\pi)$ and benefits from accurate range reduction.
Precomputed chirp table: tabulate $b[m] = \omega_N^{m^2/2}$ for the full range to avoid repeated evaluation.
Internal FFT padding: choose $M \geq 2N - 1$ rounded to a 5-smooth size so that the circular convolution mirrors the linear convolution.

With those fixes Bluestein attains the same order of relative error as Cooley-Tukey (Rabiner et al. 1969, Smith 1995).

Behaviour in sangi

sangi FFT::fft(data) branches as follows:

FFT::is_valid_size(N) is true ($N$ is 5-smooth) — Cooley-Tukey mixed radix.
Otherwise — Bluestein, which internally executes three FFTs of length $M = $ next_valid_size(2N - 1).

The Bluestein path allocates a working buffer roughly twice the size of the input, so peak memory exceeds the Cooley-Tukey path. If the same size is reused often, build an FFTEngine so the twiddle and chirp tables are cached across calls.

Use RealFFT<Real> for real-valued input. It runs a length-$N/2$ complex FFT plus $O(N)$ post-processing, roughly twice as fast as a same-length complex FFT.

References

Bluestein, L. I. (1970). "A linear filtering approach to the computation of discrete Fourier transform". IEEE Trans. Audio Electroacoust., 18, 451–455.
Rader, C. M. (1968). "Discrete Fourier transforms when the number of data samples is prime". Proc. IEEE, 56, 1107–1108.
Good, I. J. (1958). "The Interaction Algorithm and Practical Fourier Analysis". J. Royal Stat. Soc. B, 20, 361–372.
Makhoul, J. (1980). "A fast cosine transform in one and two dimensions". IEEE Trans. ASSP, 28, 27–34.
Sorensen, H. V., Jones, D. L., Heideman, M. T., and Burrus, C. S. (1987). "Real-valued fast Fourier transform algorithms". IEEE Trans. ASSP, 35, 849–863.