Multi-Dimensional FFT and Convolution
Overview
Multi-dimensional DFTs underpin image processing, spectral methods for partial differential equations, correlation analysis, and quantum chemistry. This article covers the separability-driven row-column decomposition, cache-friendly block transposition, linear vs circular convolution and zero padding, and batch FFTs.
The API surface is fft2d / fft3d for multi-dimensional
transforms and simd_fft::batchFFT for multi-signal batches.
2D FFT row-column decomposition
The DFT of an $M \times N$ array $x[m, n]$ is
$$X[k_y, k_x] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} x[m, n] \, \omega_M^{k_y m} \, \omega_N^{k_x n}.$$
The exponentials separate as $\omega_M^{k_y m} \cdot \omega_N^{k_x n}$, so summing $n$ first yields a per-row 1D FFT, and summing $m$ next yields a per-column 1D FFT.
$$X[k_y, k_x] = \sum_{m=0}^{M-1} \underbrace{\Bigl[ \sum_{n=0}^{N-1} x[m, n] \, \omega_N^{k_x n} \Bigr]}_{\text{row FFT } Y[m, k_x]} \, \omega_M^{k_y m}.$$
The total cost is $M$ length-$N$ FFTs plus $N$ length-$M$ FFTs, i.e. $O(MN \log(MN))$. Every concern that arises from odd, prime, or strided sizes lives inside the 1D FFT.
Cache-aware transposition
Row FFTs on row-major data run unit stride. Column FFTs, however, stride by $N_x$ and lose cache-line reuse. With $N_x = 2048$ and 16-byte complex doubles a column step skips 32 KB, exceeding a typical 32–48 KB L1.
The fix is to transpose between the row and column passes. A naive transpose still suffers from large strides, but tiling into $B \times B$ blocks ($B = 32$ or $64$) keeps each block resident in L1.
$$\text{Block transpose:}\;\; \forall \text{block } (I, J), \; \forall i, j \in [0, B): \;\; A^T[J B + j, I B + i] = A[I B + i, J B + j].$$
Transpose + contiguous 1D FFT generally runs 2–4x faster than a strided 1D FFT.
sangi fft2d sandwiches two block transposes (transpose, contiguous FFT, transpose back, contiguous FFT).
3D FFT and the depth axis
In 3D we apply 1D FFTs along $x$, then $y$, then $z$. With row-major layout
data[z * ny * nx + y * nx + x], $x$ is unit stride, $y$ strides by $N_x$, and $z$ by $N_x N_y$.
The depth stride is one order larger than the 2D case and may not fit in L1 even after block transposition. Common remedies:
- Slabbing along $z$ — limit the working $z$-range to keep the active set small.
- 3D cubelet transpose — transpose $B \times B \times B$ cubes.
- Axis rotation — cycle the layout $(x, y, z) \to (y, z, x) \to (z, x, y)$ between passes.
sangi fft3d uses the Tensor3D<T> container and applies a 1D FFT along each axis
in sequence. For moderate sizes (up to $256^3$ or so) the straightforward axis loop is already competitive;
large-scale GPU parallelisation is future work.
fftshift3d moves the DC component ($k = 0$) to the centre of the tensor, useful for visualisation
and aligning spectra during convolution.
Convolution theorem
The Fourier transform of the continuous convolution $(x \ast y)(t) = \int x(\tau) y(t - \tau) d\tau$ factorises as
$$\mathcal{F}\{x \ast y\}(\xi) = \mathcal{F}\{x\}(\xi) \cdot \mathcal{F}\{y\}(\xi).$$
The discrete analogue holds in the circular sense. A length-$N$ convolution therefore evaluates as FFT → pointwise product → IFFT in $O(N \log N)$, beating the direct $O(N^2)$ sum dramatically.
The same identity extends to higher dimensions:
$$\mathcal{F}_{2D}\{X \ast K\}(\xi, \eta) = \mathcal{F}_{2D}\{X\}(\xi, \eta) \cdot \mathcal{F}_{2D}\{K\}(\xi, \eta).$$
A 2D convolution costs two forward 2D FFTs, one pointwise product, and one inverse 2D FFT, cutting the naive $O(M^2 N^2)$ spatial-domain cost down to $O(MN \log(MN))$.
Linear vs circular convolution
FFT implicitly computes a length-$N$ circular convolution on inputs assumed to repeat with period $N$. The linear convolution that signal processing usually wants behaves differently:
| Linear | Circular | |
|---|---|---|
| Output length | $M + N - 1$ | $\max(M, N)$ |
| Boundary | Zero extension | Periodic wrap-around |
| Used by FIR filters? | Yes | No (would contaminate output) |
To recover a linear convolution from FFTs, zero-pad both inputs to length $L \geq M + N - 1$. Without this padding the wrap-around contaminates the tail and head of the output.
Overlap-add and overlap-save
For long signal $x$ convolved with a short kernel $h$ ($N \ll \mathrm{len}(x)$), split $x$ into blocks, filter each block via FFT, and combine the results by addition (overlap-add) or by discarding overlapping samples (overlap-save). The full $x$ never needs to enter an FFT, which is the standard pattern for steady-state FIR filtering.
Using these in sangi
FFT::convolve(a, b)— internally pads to a power-of-two of length $\geq a.size() + b.size() - 1$, returning the linear convolution.RealFFT::convolve(a, b)— real-input variant, roughly 2x faster viaRealFFT::fft.- 2D linear convolution: assemble it from
fft2d_real, pointwise multiplication, andifft2d(there is no bundledconvolve2dyet).
Batch FFT
Transforming several independent signals of the same length $N$ shows up in stereo audio, spectrograms, and MIMO communication. Processing them simultaneously, interleaved across SIMD lanes, beats the serial loop.
- Twiddle reuse: $K$ signals share the same twiddle factors per butterfly, raising L1 hit rate.
- SIMD lane packing: AVX2 fits four
floatcomplex numbers (eight floats) into one 256-bit register, running four butterflies per instruction. - Less branch overhead: signals share a single control-flow path, eliminating mispredicts.
The sangi simd_fft::batchFFT processes four signals as one
AVX2 batch and falls back to ordinary simd_fft::fft() for the leftover. The twiddle table is held
per-thread in thread_local storage.
simd_fft::stereoFFT is a separate optimisation that
treats interleaved stereo PCM (L0, R0, L1, R1, …) as a complex signal whose real part is L and imaginary part
is R, then runs one complex FFT and recovers L and R spectra from conjugate symmetry — twice as fast as two
real FFTs.
Applications
Image processing
2D FFTs are central to filtering, feature extraction, and texture analysis.
- Large-kernel convolution: a Gaussian blur with $\sigma > 10$ has a kernel width of roughly 60+ pixels; FFT-domain filtering beats spatial domain at that point.
- Phase correlation: detect translation between two images via the normalised cross-spectrum $X \bar{Y} / |X \bar{Y}|$ followed by an inverse FFT and a peak search.
- Frequency-domain filters: bandpass, highpass, notch, etc. become a single pointwise multiplication in the FFT domain.
PDE spectral methods
On periodic domains FFTs turn spatial derivatives into algebraic operations:
$$\frac{\partial}{\partial x} \;\xrightarrow{\;\mathcal{F}\;}\; i k_x, \qquad \nabla^2 \;\xrightarrow{\;\mathcal{F}\;}\; -(k_x^2 + k_y^2 + k_z^2).$$
2D Navier-Stokes, the Schrödinger equation, and Cahn-Hilliard solvers all build on this identity.
sangi fft2d and fft3d are usable directly for prototyping periodic-domain PDE solvers.
Cross-correlation
Dual to convolution, the cross-correlation $r_{xy}[k] = \sum_n x[n] \, \overline{y[n - k]}$ evaluates to $R_{xy}[\xi] = X[\xi] \, \overline{Y[\xi]}$ in the FFT domain. Template matching, time-delay estimation, and autocorrelation routinely go through an FFT.
See also: Wiener-Khinchin theorem / Window functions / PDE applications
sangi API
| Function | Purpose |
|---|---|
fft2d(matrix) | Forward 2D FFT on Matrix<complex> |
ifft2d(matrix) | Inverse 2D FFT (1/(MN) scaling) |
fft2d_real(matrix) | Real input promoted to complex then 2D FFT |
fft3d(tensor) | Forward 3D FFT on Tensor3D<complex> |
ifft3d(tensor) | Inverse 3D FFT |
fft3d_real(tensor) | Real 3D tensor FFT |
fftshift3d(tensor) | Move DC to the centre of the tensor |
simd_fft::batchFFT(signals, count, N) | AVX2 batched FFT across multiple signals |
simd_fft::stereoFFT(...) | Stereo PCM split into L and R via a single complex FFT |
References
- Eklundh, J. O. (1972). "A Fast Computer Method for Matrix Transposing". IEEE Trans. Computers, C-21, 801–803.
- Frigo, M. and Johnson, S. G. (2005). "The Design and Implementation of FFTW3". Proc. IEEE, 93, 216–231.
- Boyd, J. P. (2001). Chebyshev and Fourier Spectral Methods, 2nd ed. Dover.
- Oppenheim, A. V. and Schafer, R. W. (2010). Discrete-Time Signal Processing, 3rd ed. Pearson.