Mathematical Principles of Stable Diffusion
A rigorous derivation of SDEs, score functions, and latent diffusion
Building on the overview of diffusion models, this article rigorously develops the mathematics behind Stable Diffusion. We cover the unified formulation via stochastic differential equations (SDEs), the relationship between the score function and Langevin dynamics, diffusion in the VAE latent space, the derivation of Classifier-Free Guidance, and the ODE interpretation of DDIM, in turn.
1. Score Function and Score Matching
Definition of the score function
The score function of a probability density $p(\bx)$ is defined as the gradient of the log density:
$$\boldsymbol{s}(\bx) = \nabla_{\bx} \log p(\bx)$$It is a vector field pointing in the direction of increasing data density, and it characterizes $p(\bx)$ without the need for a normalizing constant.
Once the score function is known, one can sample from $p(\bx)$ via Langevin dynamics:
$$\bx_{k+1} = \bx_k + \dfrac{\eta}{2} \nabla_{\bx} \log p(\bx_k) + \sqrt{\eta}\, \bz_k, \quad \bz_k \sim \N(0, I)$$In the limit of step size $\eta \to 0$ and number of iterations $K \to \infty$, $\bx_K \sim p(\bx)$.
Score matching
We approximate the score function with a neural network $\boldsymbol{s}_\theta(\bx)$. The direct score-matching loss is:
$$\L_{\text{SM}} = \E_{p(\bx)}\left[\dfrac{1}{2}\left\|\boldsymbol{s}_\theta(\bx) - \nabla_{\bx}\log p(\bx)\right\|^2\right]$$Because the true score is unknown, we use denoising score matching (Vincent, 2011). For the distribution $p_\sigma(\tilde{\bx}|\bx) = \N(\tilde{\bx}; \bx, \sigma^2 I)$ obtained by adding noise $\sigma$:
$$\L_{\text{DSM}} = \E_{\bx \sim p(\bx),\, \tilde{\bx} \sim p_\sigma(\tilde{\bx}|\bx)}\left[\dfrac{1}{2}\left\|\boldsymbol{s}_\theta(\tilde{\bx}, \sigma) - \nabla_{\tilde{\bx}}\log p_\sigma(\tilde{\bx}|\bx)\right\|^2\right]$$For Gaussian noise the conditional score can be computed explicitly:
$$\nabla_{\tilde{\bx}} \log p_\sigma(\tilde{\bx}|\bx) = -\dfrac{\tilde{\bx} - \bx}{\sigma^2} = -\dfrac{\bepsilon}{\sigma}$$Equivalence of the score function and noise prediction
The DDPM noise-prediction network $\bepsilon_\theta(\bx_t, t)$ and the score function are related by (hereafter $\beta_t$ denotes the noise schedule, $\alpha_t = 1 - \beta_t$, and $\bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$):
$$\boldsymbol{s}_\theta(\bx_t, t) = -\dfrac{\bepsilon_\theta(\bx_t, t)}{\sqrt{1 - \bar{\alpha}_t}}$$In other words, training a diffusion model is essentially equivalent to score matching.
2. A Unified Formulation via SDEs
Song et al. (2021) gave a unified formulation of diffusion models as continuous-time stochastic differential equations (SDEs).
VP-SDE (variance preserving)
The SDE corresponding to DDPM takes the following form:
$$d\bx = -\dfrac{1}{2}\beta(t)\bx\,dt + \sqrt{\beta(t)}\,d\boldsymbol{w}$$Here $\beta(t)$ is the noise schedule, and $\beta(t) = \beta_{\min} + (\beta_{\max} - \beta_{\min})t$ corresponds to a linear schedule.
VE-SDE (variance exploding)
This corresponds to SMLD (Score Matching with Langevin Dynamics):
$$d\bx = \sqrt{\dfrac{d[\sigma^2(t)]}{dt}}\,d\boldsymbol{w}$$Probability-flow ODE
For any SDE there exists an ordinary differential equation (ODE) that produces the same marginal distributions $p_t(\bx)$:
Probability-flow ODE
$$\dfrac{d\bx}{dt} = f(\bx, t) - \dfrac{1}{2}g(t)^2 \nabla_{\bx}\log p_t(\bx)$$The difference from the SDE is the absence of the noise term $g(t)\,d\boldsymbol{w}$. Because it is deterministic, the same initial value yields the same trajectory. DDIM can be interpreted as a discretization of this ODE.
3. Derivation of the DDPM Variational Bound
The DDPM training objective is derived from the variational lower bound (ELBO) on the data log-likelihood.
Lower bound on the log-likelihood
$$\log p_\theta(\bx_0) \geq \E_q\left[\log \dfrac{p_\theta(\bx_{0:T})}{q(\bx_{1:T}|\bx_0)}\right] = -\L_{\text{VLB}}$$The variational bound $\L_{\text{VLB}}$ decomposes into three types of terms:
$$\L_{\text{VLB}} = \underbrace{D_{\KL}(q(\bx_T|\bx_0) \| p(\bx_T))}_{\L_T} + \displaystyle\sum_{t=2}^{T} \underbrace{D_{\KL}(q(\bx_{t-1}|\bx_t, \bx_0) \| p_\theta(\bx_{t-1}|\bx_t))}_{\L_{t-1}} - \underbrace{\log p_\theta(\bx_0|\bx_1)}_{\L_0}$$Computing the posterior
By Bayes' theorem, $q(\bx_{t-1}|\bx_t, \bx_0)$ can be computed in closed form as a Gaussian:
$$q(\bx_{t-1}|\bx_t, \bx_0) = \N(\bx_{t-1};\, \tilde{\bmu}_t(\bx_t, \bx_0),\, \tilde{\beta}_t I)$$where:
$$\tilde{\bmu}_t(\bx_t, \bx_0) = \dfrac{\sqrt{\bar{\alpha}_{t-1}}\,\beta_t}{1 - \bar{\alpha}_t}\bx_0 + \dfrac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t}\bx_t, \quad \tilde{\beta}_t = \dfrac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t}\beta_t$$Reparameterization and the simplified loss
Using $\bx_t = \sqrt{\bar{\alpha}_t}\bx_0 + \sqrt{1-\bar{\alpha}_t}\bepsilon$ to eliminate $\bx_0$ and express the mean in terms of $\bepsilon$:
$$\tilde{\bmu}_t(\bx_t, \bepsilon) = \dfrac{1}{\sqrt{\alpha_t}}\left(\bx_t - \dfrac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}}\bepsilon\right)$$Designing the mean of $p_\theta$ as $\bmu_\theta(\bx_t, t) = \dfrac{1}{\sqrt{\alpha_t}}\left(\bx_t - \dfrac{\beta_t}{\sqrt{1-\bar{\alpha}_t}}\bepsilon_\theta(\bx_t, t)\right)$ gives:
The simplified DDPM loss
$$\L_{\text{simple}} = \E_{t \sim U[1,T],\, \bx_0,\, \bepsilon \sim \N(0,I)}\left[\left\|\bepsilon - \bepsilon_\theta(\sqrt{\bar{\alpha}_t}\bx_0 + \sqrt{1-\bar{\alpha}_t}\bepsilon,\, t)\right\|^2\right]$$This removes the time-dependent weighting factor $\dfrac{\beta_t^2}{2\sigma_t^2 \alpha_t(1-\bar{\alpha}_t)}$ from each term $\L_{t-1}$ of $\L_{\text{VLB}}$ (i.e., it weights them uniformly), which experimentally yields more stable training.
4. The ODE Interpretation of DDIM
Song et al. (2020) proposed DDIM, which uses the same trained model $\bepsilon_\theta$ as DDPM while constructing a non-Markovian reverse process.
The DDIM update
$$\bx_{t-1} = \sqrt{\bar{\alpha}_{t-1}}\underbrace{\left(\dfrac{\bx_t - \sqrt{1-\bar{\alpha}_t}\,\bepsilon_\theta(\bx_t, t)}{\sqrt{\bar{\alpha}_t}}\right)}_{\text{"predicted } \bx_0\text{"}} + \underbrace{\sqrt{1-\bar{\alpha}_{t-1} - \sigma_t^2}\,\bepsilon_\theta(\bx_t, t)}_{\text{direction pointing to } \bx_t} + \underbrace{\sigma_t\,\bz_t}_{\text{random noise}}$$Setting $\sigma_t = 0$ gives a fully deterministic update, which is DDIM. Setting $\sigma_t = \sqrt{\dfrac{(1-\bar{\alpha}_{t-1})}{(1-\bar{\alpha}_t)}\beta_t}$ recovers DDPM.
Because DDIM ($\sigma_t = 0$) corresponds to a discretization of the probability-flow ODE, computing it over a subset of timesteps $\{\tau_1, \tau_2, \ldots, \tau_S\} \subset \{1, \ldots, T\}$ ($S \ll T$) yields good samples even with a small number of steps.
5. VAE Latent Space and Latent Diffusion
The core idea of Stable Diffusion is to perform diffusion not in pixel space but in the VAE latent space (Rombach et al., 2022).
The math of the VAE
An image $\bx \in \R^{H \times W \times 3}$ is compressed by the encoder $\mathcal{E}$ into a latent $\bz = \mathcal{E}(\bx) \in \R^{h \times w \times c}$ and reconstructed by the decoder $\mathcal{D}$. The loss is:
$$\L_{\text{VAE}} = \underbrace{\|\bx - \mathcal{D}(\mathcal{E}(\bx))\|^2}_{\text{reconstruction loss}} + \lambda_{\KL} \cdot \underbrace{D_{\KL}(q(\bz|\bx) \| \N(0, I))}_{\text{KL regularization}}$$Stable Diffusion uses $f = H/h = W/w = 8$ (8× downsampling) and $c = 4$. The VAE is pretrained independently of the diffusion model.
Diffusion in latent space
Rather than the pixel-space $\bx$, the usual diffusion (noising/denoising) is applied to the latent $\bz_0 = \mathcal{E}(\bx_0)$:
$$\bz_t = \sqrt{\bar{\alpha}_t}\,\bz_0 + \sqrt{1 - \bar{\alpha}_t}\,\bepsilon, \quad \bepsilon \sim \N(0, I)$$ $$\L_{\text{LDM}} = \E_{t, \bz_0, \bepsilon}\left[\|\bepsilon - \bepsilon_\theta(\bz_t, t, c)\|^2\right]$$At generation time, after obtaining $\bz_0$ from $\bz_T \sim \N(0, I)$ by reverse diffusion, we map back to pixel space via $\hat{\bx} = \mathcal{D}(\bz_0)$.
6. The Math of Text Conditioning
The text prompt is converted by the CLIP text encoder into a sequence of feature vectors $\boldsymbol{c} = [\boldsymbol{c}_1, \ldots, \boldsymbol{c}_L] \in \R^{L \times d_c}$. This condition is injected via Cross-Attention layers inside the U-Net:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\dfrac{QK^\top}{\sqrt{d}}\right)V$$ $$Q = W_Q^{(i)} \cdot \varphi_i(\bz_t), \quad K = W_K^{(i)} \cdot \boldsymbol{c}, \quad V = W_V^{(i)} \cdot \boldsymbol{c}$$Here $\varphi_i(\bz_t)$ is an intermediate representation of the U-Net (the flattened feature map of layer $i$). Queries come from the image features, while Keys and Values come from the text features. In this way the semantic information of the text spatially guides image generation.
7. Derivation of Classifier-Free Guidance
We give the mathematical basis of Classifier-Free Guidance (CFG; Ho & Salimans, 2022), which controls the quality of conditional generation.
Derivation from Bayes' theorem
The score of the conditional distribution is:
$$\nabla_{\bz} \log p(\bz|c) = \nabla_{\bz} \log p(\bz) + \nabla_{\bz} \log p(c|\bz)$$We amplify the score with a guidance strength $w$:
$$\nabla_{\bz} \log p_w(\bz|c) = \nabla_{\bz} \log p(\bz) + w \cdot \nabla_{\bz} \log p(c|\bz)$$Substituting $\nabla_{\bz}\log p(c|\bz) = \nabla_{\bz}\log p(\bz|c) - \nabla_{\bz}\log p(\bz)$:
The Classifier-Free Guidance formula
$$\nabla_{\bz}\log p_w(\bz|c) = (1-w)\,\nabla_{\bz}\log p(\bz) + w\,\nabla_{\bz}\log p(\bz|c)$$Written in terms of noise prediction:
$$\tilde{\bepsilon}_\theta(\bz_t, t, c) = (1-w)\,\bepsilon_\theta(\bz_t, t, \varnothing) + w\,\bepsilon_\theta(\bz_t, t, c)$$ $$= \bepsilon_\theta(\bz_t, t, \varnothing) + w\,\bigl(\bepsilon_\theta(\bz_t, t, c) - \bepsilon_\theta(\bz_t, t, \varnothing)\bigr)$$$w = 1$: ordinary conditional generation. $w > 1$: emphasizes fidelity to the text. $w = 7.5$ is the default setting in Stable Diffusion.
During training, with a fixed probability (typically 10%) the text condition $c$ is replaced by the empty condition $\varnothing$ via a random drop, so that a single model learns both conditional and unconditional generation.
8. The Math of the Noise Scheduler
The choice of noise schedule $\{\beta_t\}_{t=1}^T$ greatly affects generation quality.
Linear schedule (DDPM)
$$\beta_t = \beta_1 + \dfrac{t-1}{T-1}(\beta_T - \beta_1), \quad \beta_1 = 10^{-4},\; \beta_T = 0.02$$Cosine schedule (Nichol & Dhariwal, 2021)
Designed so that $\bar{\alpha}_t$ decreases uniformly:
$$\bar{\alpha}_t = \dfrac{f(t)}{f(0)}, \quad f(t) = \cos\left(\dfrac{t/T + s}{1 + s} \cdot \dfrac{\pi}{2}\right)^2$$$s = 0.008$ is an offset (it prevents $\beta_t$ from becoming too small near $t = 0$).
Compared with the linear schedule, the cosine schedule distributes the signal-to-noise ratio (SNR) more uniformly across intermediate steps, improving quality especially for high-resolution images.
Signal-to-noise ratio (SNR)
$$\text{SNR}(t) = \dfrac{\bar{\alpha}_t}{1 - \bar{\alpha}_t}$$SNR → ∞ at $t = 0$ (pure signal) and SNR → 0 at $t = T$ (pure noise). $\L_{\text{simple}}$ corresponds to uniform weighting in the logarithm of the SNR.
9. v-prediction (SD 2.x)
SD 1.x predicts $\bepsilon$ (the noise), whereas SD 2.x adopted v-prediction (Salimans & Ho, 2022). The prediction target $\boldsymbol{v}$ is defined as a linear combination of $\bx_0$ and $\bepsilon$:
Definition of v-prediction
$$\boldsymbol{v}_t = \sqrt{\bar{\alpha}_t}\,\bepsilon - \sqrt{1-\bar{\alpha}_t}\,\bx_0$$Solving inversely:
$$\hat{\bx}_0 = \sqrt{\bar{\alpha}_t}\,\bx_t - \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{v}_\theta(\bx_t, t)$$ $$\hat{\bepsilon} = \sqrt{1-\bar{\alpha}_t}\,\bx_t + \sqrt{\bar{\alpha}_t}\,\boldsymbol{v}_\theta(\bx_t, t)$$Advantages of v-prediction
- SNR-independent stability: with $\bepsilon$-prediction, as $t \to 0$ (high SNR) the signal dominates and the gradient can become unstable. $\boldsymbol{v}$ corresponds to an angular parameter between noise and signal and yields a stable gradient regardless of $t$.
- Numerical stability: in the limit $\text{SNR}(t) \to \infty$, the $\bepsilon$-prediction loss diverges, while the $\boldsymbol{v}$-prediction loss stays bounded.
The loss function simply replaces the prediction target:
$$\L_{\text{v}} = \E_{t, \bx_0, \bepsilon}\left[\|\boldsymbol{v}_t - \boldsymbol{v}_\theta(\bx_t, t)\|^2\right]$$OpenCLIP ViT-H/14 text encoder
SD 1.x used OpenAI's CLIP ViT-L/14 (768 dimensions, about 123M parameters), but SD 2.x switched to OpenCLIP ViT-H/14 (1024 dimensions, about 354M parameters).
- Increased output dimension: the Key/Value of Cross-Attention go from 768d to 1024d, giving a richer semantic text representation. The sizes of the U-Net's $W_K, W_V$ projection matrices also change to $d_{\text{model}} \times 1024$.
- Trained on LAION-2B: OpenCLIP is trained on the open dataset LAION-2B, removing the dependence on OpenAI CLIP's private dataset.
- Penultimate layer: SD 2.x uses the output of the second-to-last (penultimate) layer of the text encoder rather than the final layer. The final layer is too specialized for the contrastive learning objective, whereas the intermediate layer has richer features for the generation task.
768×768 default resolution
In SD 2.1 the training resolution was raised from 512×512 to 768×768. In latent space this becomes $64 \times 64 \to 96 \times 96$ (the downsampling factor $f = 8$ is unchanged), and the number of tokens the U-Net processes increases from $64^2 = 4{,}096$ to $96^2 = 9{,}216$. Since the cost of Self-Attention is $O(n^2)$, this is roughly a 5× increase in compute.
Negative prompt and CFG
In the SD 1.x CFG (Section 7), the empty string $\varnothing$ was used for the unconditional prediction $\bepsilon_\theta(\bz_t, t, \varnothing)$. From SD 2.x onward, a negative prompt $c_{\text{neg}}$ can be specified instead of the empty string:
CFG with a negative prompt
$$\tilde{\bepsilon}_\theta(\bz_t, t, c, c_{\text{neg}}) = \bepsilon_\theta(\bz_t, t, c_{\text{neg}}) + w\,\bigl(\bepsilon_\theta(\bz_t, t, c) - \bepsilon_\theta(\bz_t, t, c_{\text{neg}})\bigr)$$When $c_{\text{neg}} = \varnothing$, this coincides with ordinary CFG. Specifying $c_{\text{neg}} =$ "blurry, low quality" makes the guidance direction "from $c_{\text{neg}}$ toward $c$," allowing unwanted features to be removed more aggressively.
Geometrically, the CFG guidance vector is the difference between the conditional and unconditional scores, and the negative prompt moves this "reference point" from the empty string to an arbitrary text. The larger the guidance strength $w$, the greater the displacement in the direction from $c_{\text{neg}}$ to $c$.
10. The SDXL Architecture
SDXL (Podell et al., 2023) is an architecture substantially expanded from SD 1.x/2.x.
Dual text encoders
SDXL concatenates the outputs of two text encoders:
$$\boldsymbol{c}_{\text{cross}} = [\boldsymbol{c}_{\text{CLIP-L}};\, \boldsymbol{c}_{\text{OpenCLIP-G}}] \in \R^{77 \times 2048}$$In addition, the pooled output of OpenCLIP (1280 dimensions) is added to the time embedding to inject global text information into the model.
Micro-conditioning (size and crop conditioning)
The original size and crop coordinates of the training data are given as additional conditions, and the target size is specified at generation time. They are embedded with Fourier features and added to the time embedding:
$$\boldsymbol{e}_{\text{micro}} = \text{MLP}\bigl(\text{FourierEmbed}(h_{\text{orig}}, w_{\text{orig}}, \text{top}, \text{left}, h_{\text{tgt}}, w_{\text{tgt}})\bigr)$$This lets the model avoid, at generation time, artifacts such as the "blur" and "unnatural crops" learned from low-resolution training data.
Refiner
The Base model performs the initial large-scale denoising ($t = T \to t_{\text{switch}}$), and the Refiner handles the remaining refinement ($t = t_{\text{switch}} \to 0$), in an ensemble-of-expert-denoisers scheme. $t_{\text{switch}} \approx 200/1000$ is a typical switch point.
11. Diffusion Transformer (DiT / MM-DiT)
SD 3.0 (Esser et al., 2024) abandoned the U-Net entirely and adopted MM-DiT (Multimodal DiT), based on the Diffusion Transformer (DiT; Peebles & Xie, 2023).
Basic structure of DiT
The latent $\bz \in \R^{h \times w \times c}$ is split into $p \times p$ patches and converted by a linear projection into a token sequence $\boldsymbol{T}_{\text{img}} \in \R^{N \times D}$ ($N = hw/p^2$). Positional embeddings are added, and Transformer blocks are repeated.
MM-DiT (the Joint Attention of SD 3.0)
Self-Attention is computed over the sequence formed by concatenating the image and text tokens:
$$\boldsymbol{T} = [\boldsymbol{T}_{\text{img}};\, \boldsymbol{T}_{\text{txt}}] \in \R^{(N_{\text{img}} + N_{\text{txt}}) \times D}$$ $$\text{JointAttn}(\boldsymbol{T}) = \text{softmax}\left(\dfrac{Q_{\text{all}} K_{\text{all}}^\top}{\sqrt{d}}\right) V_{\text{all}}$$However, the image and text have independent QKV projection matrices (modality-specific feature extraction). Since the Attention computation itself is unified, text and image interact symmetrically.
AdaLN-Zero (adaptive normalization)
In DiT, the timestep $t$ and the text pooled embedding are used to dynamically generate the LayerNorm scale $\gamma$ and shift $\beta$ of each block, as well as the gate $\alpha$ of the residual connection:
$$[\gamma_1, \beta_1, \alpha_1, \gamma_2, \beta_2, \alpha_2] = \text{MLP}(\boldsymbol{e}_t + \boldsymbol{e}_{\text{pool}})$$ $$\hat{\boldsymbol{h}} = \alpha_1 \odot \text{Attention}\bigl(\gamma_1 \odot \text{LN}(\boldsymbol{h}) + \beta_1\bigr) + \boldsymbol{h}$$Text encoders (triple configuration)
SD 3.x uses three text encoders:
- CLIP ViT-L/14 (768d): basic text understanding
- OpenCLIP ViT-bigG/14 (1280d): high-quality text–image correspondence
- T5-XXL (4096d): understanding of long, complex instructions (language-model based)
The addition of T5 greatly improves fidelity to long natural-language instructions ("a red car parked next to a blue bicycle on a rainy street").
12. Flow Matching and Rectified Flow (SD 3.x / FLUX)
SD 3.x and FLUX leave behind the DDPM framework of "gradually adding/removing noise" and are based on Flow Matching (Lipman et al., 2023) / Rectified Flow (Liu et al., 2023).
Continuous normalizing flow (CNF)
We learn a time-dependent vector field $\boldsymbol{v}_\theta(\bx, t)$ connecting the data distribution $p_0$ and the noise distribution $p_1 = \N(0, I)$. The ODE is:
$$\dfrac{d\bx_t}{dt} = \boldsymbol{v}_\theta(\bx_t, t), \quad t \in [0, 1]$$Data $\bx_0$ corresponds to $t = 0$ and noise $\bx_1$ to $t = 1$. At generation time we solve the ODE from $\bx_1 \sim \N(0, I)$ in the direction $t = 1 \to 0$.
The Flow Matching loss
We define the intermediate state $\bx_t$ by linear interpolation between data $\bx_0$ and noise $\bx_1 \sim \N(0, I)$:
$$\bx_t = (1 - t)\,\bx_0 + t\,\bx_1$$The conditional vector field along this interpolation is:
$$\boldsymbol{u}_t(\bx_t | \bx_0, \bx_1) = \bx_1 - \bx_0$$(a constant-speed straight line connecting $\bx_0$ and $\bx_1$). The Flow Matching loss is:
Conditional Flow Matching loss
$$\L_{\text{CFM}} = \E_{t \sim U[0,1],\, \bx_0 \sim p_0,\, \bx_1 \sim \N(0,I)}\left[\|\boldsymbol{v}_\theta(\bx_t, t) - (\bx_1 - \bx_0)\|^2\right]$$The network $\boldsymbol{v}_\theta$ only needs to predict the difference direction between data and noise. It is formally similar to DDPM's $\bepsilon$-prediction, but because the trajectory is straight, the discretization error is small even with few steps.
Rectified Flow (ReFlow)
After one round of Flow Matching training, one resamples pairs $(\hat{\bx}_0, \hat{\bx}_1)$ from the trained model's trajectories and reconnects them with straight lines; repeating this reflow procedure makes the trajectory even closer to a straight line and dramatically improves generation quality at 1–4 steps.
Concrete settings of SD 3.x
Flow Matching settings of SD 3.0 / 3.5
- Time $t \in [0, 1]$: $t = 0$ is data, $t = 1$ is noise
- SNR-based time sampling: sample $t$ from a logit-normal distribution (emphasizing intermediate steps)
- Sampler: solve the ODE with Euler's method (first order) or DPM-Solver (higher order)
- Number of steps: 28–50 steps is standard
The FLUX architecture
Black Forest Labs' FLUX (2024) further develops Rectified Flow + DiT:
- Single/Double stream blocks: Joint Attention of image and text in shallow layers (Double), switching to image-only Self-Attention in deep layers (Single)
- Rotary Position Embedding: applies RoPE to 2D image patches, supporting arbitrary aspect ratios
- Guidance distillation: a distilled model capable of high-quality generation even without CFG (FLUX.1-schnell runs in 1–4 steps)
- CLIP-L + T5-XXL: drops OpenCLIP-bigG from SD 3.x's three encoders, using T5-XXL as the main encoder together with CLIP-L's pooled embedding (the exact configuration can vary by checkpoint and implementation)
Summary
- The score function $\nabla_\bx \log p(\bx)$ is the theoretical foundation of diffusion models and corresponds directly to noise prediction
- The SDE framework treats DDPM and SMLD uniformly and lets us derive the probability-flow ODE
- The DDPM loss is derived from the variational bound (ELBO) and simplifies to $\|\bepsilon - \bepsilon_\theta\|^2$
- DDIM is a discretization of the probability-flow ODE and generates deterministically in few steps
- Latent Diffusion (SD 1.x) performs diffusion in the VAE latent space, achieving a 48× reduction in dimensionality
- v-prediction (SD 2.x) predicts the angular parameter between noise and signal, improving numerical stability
- SDXL achieves high-resolution generation with dual text encoders, micro-conditioning, and a Refiner
- DiT / MM-DiT (SD 3.x) replaces the U-Net with a Transformer and introduces Joint Attention of image and text
- Flow Matching / Rectified Flow (SD 3.x, FLUX) learns an ODE connecting data and noise by straight lines, enabling few-step generation
- Classifier-Free Guidance is a linear combination of conditional scores, derived from Bayes' theorem
References
- Ho et al., "Denoising Diffusion Probabilistic Models" (DDPM, 2020)
- Song et al., "Denoising Diffusion Implicit Models" (DDIM, 2020)
- Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (2021)
- Rombach et al., "High-Resolution Image Synthesis with Latent Diffusion Models" (Stable Diffusion, 2022)
- Ho & Salimans, "Classifier-Free Diffusion Guidance" (2022)
- Nichol & Dhariwal, "Improved Denoising Diffusion Probabilistic Models" (2021)
- Vincent, "A Connection Between Score Matching and Denoising Autoencoders" (2011)
- Salimans & Ho, "Progressive Distillation for Fast Sampling of Diffusion Models" (v-prediction, 2022)
- Podell et al., "SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis" (2023)
- Peebles & Xie, "Scalable Diffusion Models with Transformers" (DiT, 2023)
- Esser et al., "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis" (SD 3.0, 2024)
- Lipman et al., "Flow Matching for Generative Modeling" (2023)
- Liu et al., "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow" (2023)
- Diffusion model - Wikipedia
- Stochastic differential equation - Wikipedia
Related Articles
Frequently Asked Questions
Q: How are diffusion models and score functions related?
A: The noise-prediction network $\bepsilon_\theta(\bx_t, t)$ is related to the score function by $\boldsymbol{s}_\theta = -\bepsilon_\theta / \sqrt{1-\bar{\alpha}_t}$. In other words, training a diffusion model is equivalent to score matching.
Q: Why does Stable Diffusion run diffusion in latent space?
A: Diffusing in the 512×512×3 pixel space (786,432 dimensions) is computationally enormous. Compressing to a 64×64×4 latent space (16,384 dimensions) with a VAE and then diffusing yields roughly a 48× reduction in dimensionality and a dramatic improvement in efficiency.
Q: Why can DDIM generate images in fewer steps?
A: DDIM corresponds to a discretization of the probability-flow ODE. Because an ODE can be solved with arbitrary step sizes, good images can be generated in 50–100 steps rather than 1000.