Multi-Dimensional FFT and Convolution

Overview

Multi-dimensional DFTs underpin image processing, spectral methods for partial differential equations, correlation analysis, and quantum chemistry. This article covers the separability-driven row-column decomposition, cache-friendly block transposition, linear vs circular convolution and zero padding, and batch FFTs.

The API surface is fft2d / fft3d for multi-dimensional transforms and simd_fft::batchFFT for multi-signal batches.

2D FFT row-column decomposition

The DFT of an $M \times N$ array $x[m, n]$ is

$$X[k_y, k_x] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} x[m, n] \, \omega_M^{k_y m} \, \omega_N^{k_x n}.$$

The exponentials separate as $\omega_M^{k_y m} \cdot \omega_N^{k_x n}$, so summing $n$ first yields a per-row 1D FFT, and summing $m$ next yields a per-column 1D FFT.

$$X[k_y, k_x] = \sum_{m=0}^{M-1} \underbrace{\Bigl[ \sum_{n=0}^{N-1} x[m, n] \, \omega_N^{k_x n} \Bigr]}_{\text{row FFT } Y[m, k_x]} \, \omega_M^{k_y m}.$$

The total cost is $M$ length-$N$ FFTs plus $N$ length-$M$ FFTs, i.e. $O(MN \log(MN))$. Every concern that arises from odd, prime, or strided sizes lives inside the 1D FFT.

Cache-aware transposition

Row FFTs on row-major data run unit stride. Column FFTs, however, stride by $N_x$ and lose cache-line reuse. With $N_x = 2048$ and 16-byte complex doubles a column step skips 32 KB, exceeding a typical 32–48 KB L1.

The fix is to transpose between the row and column passes. A naive transpose still suffers from large strides, but tiling into $B \times B$ blocks ($B = 32$ or $64$) keeps each block resident in L1.

$$\text{Block transpose:}\;\; \forall \text{block } (I, J), \; \forall i, j \in [0, B): \;\; A^T[J B + j, I B + i] = A[I B + i, J B + j].$$

Transpose + contiguous 1D FFT generally runs 2–4x faster than a strided 1D FFT. sangi fft2d sandwiches two block transposes (transpose, contiguous FFT, transpose back, contiguous FFT).

3D FFT and the depth axis

In 3D we apply 1D FFTs along $x$, then $y$, then $z$. With row-major layout data[z * ny * nx + y * nx + x], $x$ is unit stride, $y$ strides by $N_x$, and $z$ by $N_x N_y$.

The depth stride is one order larger than the 2D case and may not fit in L1 even after block transposition. Common remedies:

  • Slabbing along $z$ — limit the working $z$-range to keep the active set small.
  • 3D cubelet transpose — transpose $B \times B \times B$ cubes.
  • Axis rotation — cycle the layout $(x, y, z) \to (y, z, x) \to (z, x, y)$ between passes.

sangi fft3d uses the Tensor3D<T> container and applies a 1D FFT along each axis in sequence. For moderate sizes (up to $256^3$ or so) the straightforward axis loop is already competitive; large-scale GPU parallelisation is future work.

fftshift3d moves the DC component ($k = 0$) to the centre of the tensor, useful for visualisation and aligning spectra during convolution.

Convolution theorem

The Fourier transform of the continuous convolution $(x \ast y)(t) = \int x(\tau) y(t - \tau) d\tau$ factorises as

$$\mathcal{F}\{x \ast y\}(\xi) = \mathcal{F}\{x\}(\xi) \cdot \mathcal{F}\{y\}(\xi).$$

The discrete analogue holds in the circular sense. A length-$N$ convolution therefore evaluates as FFT → pointwise product → IFFT in $O(N \log N)$, beating the direct $O(N^2)$ sum dramatically.

The same identity extends to higher dimensions:

$$\mathcal{F}_{2D}\{X \ast K\}(\xi, \eta) = \mathcal{F}_{2D}\{X\}(\xi, \eta) \cdot \mathcal{F}_{2D}\{K\}(\xi, \eta).$$

A 2D convolution costs two forward 2D FFTs, one pointwise product, and one inverse 2D FFT, cutting the naive $O(M^2 N^2)$ spatial-domain cost down to $O(MN \log(MN))$.

Linear vs circular convolution

FFT implicitly computes a length-$N$ circular convolution on inputs assumed to repeat with period $N$. The linear convolution that signal processing usually wants behaves differently:

LinearCircular
Output length$M + N - 1$$\max(M, N)$
BoundaryZero extensionPeriodic wrap-around
Used by FIR filters?YesNo (would contaminate output)

To recover a linear convolution from FFTs, zero-pad both inputs to length $L \geq M + N - 1$. Without this padding the wrap-around contaminates the tail and head of the output.

Overlap-add and overlap-save

For long signal $x$ convolved with a short kernel $h$ ($N \ll \mathrm{len}(x)$), split $x$ into blocks, filter each block via FFT, and combine the results by addition (overlap-add) or by discarding overlapping samples (overlap-save). The full $x$ never needs to enter an FFT, which is the standard pattern for steady-state FIR filtering.

Using these in sangi

  • FFT::convolve(a, b) — internally pads to a power-of-two of length $\geq a.size() + b.size() - 1$, returning the linear convolution.
  • RealFFT::convolve(a, b) — real-input variant, roughly 2x faster via RealFFT::fft.
  • 2D linear convolution: assemble it from fft2d_real, pointwise multiplication, and ifft2d (there is no bundled convolve2d yet).

Batch FFT

Transforming several independent signals of the same length $N$ shows up in stereo audio, spectrograms, and MIMO communication. Processing them simultaneously, interleaved across SIMD lanes, beats the serial loop.

  • Twiddle reuse: $K$ signals share the same twiddle factors per butterfly, raising L1 hit rate.
  • SIMD lane packing: AVX2 fits four float complex numbers (eight floats) into one 256-bit register, running four butterflies per instruction.
  • Less branch overhead: signals share a single control-flow path, eliminating mispredicts.

The sangi simd_fft::batchFFT processes four signals as one AVX2 batch and falls back to ordinary simd_fft::fft() for the leftover. The twiddle table is held per-thread in thread_local storage.

simd_fft::stereoFFT is a separate optimisation that treats interleaved stereo PCM (L0, R0, L1, R1, …) as a complex signal whose real part is L and imaginary part is R, then runs one complex FFT and recovers L and R spectra from conjugate symmetry — twice as fast as two real FFTs.

Applications

Image processing

2D FFTs are central to filtering, feature extraction, and texture analysis.

  • Large-kernel convolution: a Gaussian blur with $\sigma > 10$ has a kernel width of roughly 60+ pixels; FFT-domain filtering beats spatial domain at that point.
  • Phase correlation: detect translation between two images via the normalised cross-spectrum $X \bar{Y} / |X \bar{Y}|$ followed by an inverse FFT and a peak search.
  • Frequency-domain filters: bandpass, highpass, notch, etc. become a single pointwise multiplication in the FFT domain.

PDE spectral methods

On periodic domains FFTs turn spatial derivatives into algebraic operations:

$$\frac{\partial}{\partial x} \;\xrightarrow{\;\mathcal{F}\;}\; i k_x, \qquad \nabla^2 \;\xrightarrow{\;\mathcal{F}\;}\; -(k_x^2 + k_y^2 + k_z^2).$$

2D Navier-Stokes, the Schrödinger equation, and Cahn-Hilliard solvers all build on this identity. sangi fft2d and fft3d are usable directly for prototyping periodic-domain PDE solvers.

Cross-correlation

Dual to convolution, the cross-correlation $r_{xy}[k] = \sum_n x[n] \, \overline{y[n - k]}$ evaluates to $R_{xy}[\xi] = X[\xi] \, \overline{Y[\xi]}$ in the FFT domain. Template matching, time-delay estimation, and autocorrelation routinely go through an FFT.

sangi API

FunctionPurpose
fft2d(matrix)Forward 2D FFT on Matrix<complex>
ifft2d(matrix)Inverse 2D FFT (1/(MN) scaling)
fft2d_real(matrix)Real input promoted to complex then 2D FFT
fft3d(tensor)Forward 3D FFT on Tensor3D<complex>
ifft3d(tensor)Inverse 3D FFT
fft3d_real(tensor)Real 3D tensor FFT
fftshift3d(tensor)Move DC to the centre of the tensor
simd_fft::batchFFT(signals, count, N)AVX2 batched FFT across multiple signals
simd_fft::stereoFFT(...)Stereo PCM split into L and R via a single complex FFT

References

  • Eklundh, J. O. (1972). "A Fast Computer Method for Matrix Transposing". IEEE Trans. Computers, C-21, 801–803.
  • Frigo, M. and Johnson, S. G. (2005). "The Design and Implementation of FFTW3". Proc. IEEE, 93, 216–231.
  • Boyd, J. P. (2001). Chebyshev and Fourier Spectral Methods, 2nd ed. Dover.
  • Oppenheim, A. V. and Schafer, R. W. (2010). Discrete-Time Signal Processing, 3rd ed. Pearson.