How is a multi-dimensional FFT computed?

By row-column decomposition: take a 1D FFT along each row, then along each column (and depth, in 3D). An M×N 2D FFT costs M length-N FFTs plus N length-M FFTs, or O(MN log(MN)) overall.

What does the convolution theorem say?

The Fourier transform of a convolution equals the pointwise product of the Fourier transforms. Implemented as FFT then pointwise multiplication then IFFT, the discrete convolution of length-N signals costs O(N log N) instead of O(N²).

How do linear and circular convolution differ?

Linear convolution produces an output of length M + N − 1 with zero extension at the boundary. Circular convolution, which is what FFT implicitly computes, has length max(M, N) and wraps around. The two coincide only when both inputs are zero-padded to length ≥ M + N − 1, which is mandatory for FIR filtering.

Why is batch FFT faster?

Processing several FFTs of the same size at once amortises the twiddle table across signals and lets AVX2 interleave four real signals into a single 256-bit lane so that each butterfly runs four-wide. sangi simd_fft::batchFFT processes four signals as one unit.

Why is cache-aware transposition necessary?

A column-major access on a row-major array strides by the row width, defeating L1/L2 reuse. Block transposing 32×32 or 64×64 tiles keeps each block within cache and turns the second-axis FFT into unit-stride accesses.

Multi-Dimensional FFT and Convolution

Overview

Multi-dimensional DFTs underpin image processing, spectral methods for partial differential equations, correlation analysis, and quantum chemistry. This article covers the separability-driven row-column decomposition, cache-friendly block transposition, linear vs circular convolution and zero padding, and batch FFTs.

The API surface is fft2d / fft3d for multi-dimensional transforms and simd_fft::batchFFT for multi-signal batches.

2D FFT row-column decomposition

The DFT of an $M \times N$ array $x[m, n]$ is

$$X[k_y, k_x] = \sum_{m=0}^{M-1} \sum_{n=0}^{N-1} x[m, n] \, \omega_M^{k_y m} \, \omega_N^{k_x n}.$$

The exponentials separate as $\omega_M^{k_y m} \cdot \omega_N^{k_x n}$, so summing $n$ first yields a per-row 1D FFT, and summing $m$ next yields a per-column 1D FFT.

$$X[k_y, k_x] = \sum_{m=0}^{M-1} \underbrace{\Bigl[ \sum_{n=0}^{N-1} x[m, n] \, \omega_N^{k_x n} \Bigr]}_{\text{row FFT } Y[m, k_x]} \, \omega_M^{k_y m}.$$

The total cost is $M$ length-$N$ FFTs plus $N$ length-$M$ FFTs, i.e. $O(MN \log(MN))$. Every concern that arises from odd, prime, or strided sizes lives inside the 1D FFT.

Cache-aware transposition

Row FFTs on row-major data run unit stride. Column FFTs, however, stride by $N_x$ and lose cache-line reuse. With $N_x = 2048$ and 16-byte complex doubles a column step skips 32 KB, exceeding a typical 32–48 KB L1.

The fix is to transpose between the row and column passes. A naive transpose still suffers from large strides, but tiling into $B \times B$ blocks ($B = 32$ or $64$) keeps each block resident in L1.

$$\text{Block transpose:}\;\; \forall \text{block } (I, J), \; \forall i, j \in [0, B): \;\; A^T[J B + j, I B + i] = A[I B + i, J B + j].$$

Transpose + contiguous 1D FFT generally runs 2–4x faster than a strided 1D FFT. sangi fft2d sandwiches two block transposes (transpose, contiguous FFT, transpose back, contiguous FFT).

3D FFT and the depth axis

In 3D we apply 1D FFTs along $x$, then $y$, then $z$. With row-major layout data[z * ny * nx + y * nx + x], $x$ is unit stride, $y$ strides by $N_x$, and $z$ by $N_x N_y$.

The depth stride is one order larger than the 2D case and may not fit in L1 even after block transposition. Common remedies:

Slabbing along $z$ — limit the working $z$-range to keep the active set small.
3D cubelet transpose — transpose $B \times B \times B$ cubes.
Axis rotation — cycle the layout $(x, y, z) \to (y, z, x) \to (z, x, y)$ between passes.

sangi fft3d uses the Tensor3D<T> container and applies a 1D FFT along each axis in sequence. For moderate sizes (up to $256^3$ or so) the straightforward axis loop is already competitive; large-scale GPU parallelisation is future work.

fftshift3d moves the DC component ($k = 0$) to the centre of the tensor, useful for visualisation and aligning spectra during convolution.

Convolution theorem

The Fourier transform of the continuous convolution $(x \ast y)(t) = \int x(\tau) y(t - \tau) d\tau$ factorises as

$$\mathcal{F}\{x \ast y\}(\xi) = \mathcal{F}\{x\}(\xi) \cdot \mathcal{F}\{y\}(\xi).$$

The discrete analogue holds in the circular sense. A length-$N$ convolution therefore evaluates as FFT → pointwise product → IFFT in $O(N \log N)$, beating the direct $O(N^2)$ sum dramatically.

The same identity extends to higher dimensions:

$$\mathcal{F}_{2D}\{X \ast K\}(\xi, \eta) = \mathcal{F}_{2D}\{X\}(\xi, \eta) \cdot \mathcal{F}_{2D}\{K\}(\xi, \eta).$$

A 2D convolution costs two forward 2D FFTs, one pointwise product, and one inverse 2D FFT, cutting the naive $O(M^2 N^2)$ spatial-domain cost down to $O(MN \log(MN))$.

Linear vs circular convolution

FFT implicitly computes a length-$N$ circular convolution on inputs assumed to repeat with period $N$. The linear convolution that signal processing usually wants behaves differently:

	Linear	Circular
Output length	$M + N - 1$	$\max(M, N)$
Boundary	Zero extension	Periodic wrap-around
Used by FIR filters?	Yes	No (would contaminate output)

To recover a linear convolution from FFTs, zero-pad both inputs to length $L \geq M + N - 1$. Without this padding the wrap-around contaminates the tail and head of the output.

Overlap-add and overlap-save

For long signal $x$ convolved with a short kernel $h$ ($N \ll \mathrm{len}(x)$), split $x$ into blocks, filter each block via FFT, and combine the results by addition (overlap-add) or by discarding overlapping samples (overlap-save). The full $x$ never needs to enter an FFT, which is the standard pattern for steady-state FIR filtering.

Using these in sangi

FFT::convolve(a, b) — internally pads to a power-of-two of length $\geq a.size() + b.size() - 1$, returning the linear convolution.
RealFFT::convolve(a, b) — real-input variant, roughly 2x faster via RealFFT::fft.
2D linear convolution: assemble it from fft2d_real, pointwise multiplication, and ifft2d (there is no bundled convolve2d yet).

Batch FFT

Transforming several independent signals of the same length $N$ shows up in stereo audio, spectrograms, and MIMO communication. Processing them simultaneously, interleaved across SIMD lanes, beats the serial loop.

Twiddle reuse: $K$ signals share the same twiddle factors per butterfly, raising L1 hit rate.
SIMD lane packing: AVX2 fits four float complex numbers (eight floats) into one 256-bit register, running four butterflies per instruction.
Less branch overhead: signals share a single control-flow path, eliminating mispredicts.

The sangi simd_fft::batchFFT processes four signals as one AVX2 batch and falls back to ordinary simd_fft::fft() for the leftover. The twiddle table is held per-thread in thread_local storage.

simd_fft::stereoFFT is a separate optimisation that treats interleaved stereo PCM (L0, R0, L1, R1, …) as a complex signal whose real part is L and imaginary part is R, then runs one complex FFT and recovers L and R spectra from conjugate symmetry — twice as fast as two real FFTs.

Applications

Image processing

2D FFTs are central to filtering, feature extraction, and texture analysis.

Large-kernel convolution: a Gaussian blur with $\sigma > 10$ has a kernel width of roughly 60+ pixels; FFT-domain filtering beats spatial domain at that point.
Phase correlation: detect translation between two images via the normalised cross-spectrum $X \bar{Y} / |X \bar{Y}|$ followed by an inverse FFT and a peak search.
Frequency-domain filters: bandpass, highpass, notch, etc. become a single pointwise multiplication in the FFT domain.

PDE spectral methods

On periodic domains FFTs turn spatial derivatives into algebraic operations:

$$\frac{\partial}{\partial x} \;\xrightarrow{\;\mathcal{F}\;}\; i k_x, \qquad \nabla^2 \;\xrightarrow{\;\mathcal{F}\;}\; -(k_x^2 + k_y^2 + k_z^2).$$

2D Navier-Stokes, the Schrödinger equation, and Cahn-Hilliard solvers all build on this identity. sangi fft2d and fft3d are usable directly for prototyping periodic-domain PDE solvers.

Cross-correlation

Dual to convolution, the cross-correlation $r_{xy}[k] = \sum_n x[n] \, \overline{y[n - k]}$ evaluates to $R_{xy}[\xi] = X[\xi] \, \overline{Y[\xi]}$ in the FFT domain. Template matching, time-delay estimation, and autocorrelation routinely go through an FFT.

sangi API

Function	Purpose
`fft2d(matrix)`	Forward 2D FFT on `Matrix<complex>`
`ifft2d(matrix)`	Inverse 2D FFT (1/(MN) scaling)
`fft2d_real(matrix)`	Real input promoted to complex then 2D FFT
`fft3d(tensor)`	Forward 3D FFT on `Tensor3D<complex>`
`ifft3d(tensor)`	Inverse 3D FFT
`fft3d_real(tensor)`	Real 3D tensor FFT
`fftshift3d(tensor)`	Move DC to the centre of the tensor
`simd_fft::batchFFT(signals, count, N)`	AVX2 batched FFT across multiple signals
`simd_fft::stereoFFT(...)`	Stereo PCM split into L and R via a single complex FFT

References

Eklundh, J. O. (1972). "A Fast Computer Method for Matrix Transposing". IEEE Trans. Computers, C-21, 801–803.
Frigo, M. and Johnson, S. G. (2005). "The Design and Implementation of FFTW3". Proc. IEEE, 93, 216–231.
Boyd, J. P. (2001). Chebyshev and Fourier Spectral Methods, 2nd ed. Dover.
Oppenheim, A. V. and Schafer, R. W. (2010). Discrete-Time Signal Processing, 3rd ed. Pearson.