SDE vs ODE: Mathematical Foundations of Score-based Diffusion

SDE vs ODE: Mathematical Foundations of Score-based Diffusion
Stochastic vs Deterministic. Same distribution, different paths.
TL;DR
- SDE (Stochastic DE): Probabilistic paths with noise, theoretical basis of DDPM
- ODE (Ordinary DE): Deterministic paths, basis of DDIM and Flow Matching
- Probability Flow ODE: An ODE with the same marginal distribution as SDE
- Key Difference: SDE = more diversity, less speed; ODE = less diversity, more speed
1. Why Differential Equations?
The Essence of Diffusion
Diffusion models are transformations between two distributions:
- Forward: Data $p_{\text{data}}$ → Noise $\mathcal{N}(0, I)$
- Reverse: Noise $\mathcal{N}(0, I)$ → Data $p_{\text{data}}$
Modeling this transformation in continuous time gives us differential equations.
Discrete vs Continuous
DDPM (discrete):
$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$$
Continuous-time SDE:
$$dx = f(x, t)dt + g(t)dw$$
The continuous-time view is more flexible and enables various sampler designs.
2. Forward SDE: From Data to Noise
Variance Preserving SDE (VP-SDE)
The continuous SDE corresponding to DDPM:
$$dx = -\frac{1}{2}\beta(t)x \, dt + \sqrt{\beta(t)} \, dw$$
Where:
- $\beta(t)$: noise schedule (noise intensity over time)
- $dw$: Wiener process (Brownian motion)
Variance Exploding SDE (VE-SDE)
The SDE corresponding to SMLD/NCSN:
$$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} \, dw$$
Where $\sigma(t)$ is the noise scale increasing over time.
Solution of Forward Process
For VP-SDE, the distribution at time $t$ is:
$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$
Where $\bar{\alpha}_t = e^{-\int_0^t \beta(s)ds}$
This exactly matches DDPM's forward process!
3. Reverse SDE: From Noise to Data
Anderson's Theorem
A remarkable fact: Running the forward SDE backwards in time is also an SDE!
Forward:
$$dx = f(x, t)dt + g(t)dw$$
Reverse:
$$dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$$
Where:
- $\nabla_x \log p_t(x)$: Score function (the key!)
- $d\bar{w}$: Reverse-time Wiener process
What is the Score Function?
$$\nabla_x \log p_t(x) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$
The score is the "gradient pointing toward data from current position."
Relationship between DDPM's noise prediction $\epsilon_\theta$ and score:
$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$
4. Probability Flow ODE
The Key Discovery
A crucial finding by Song et al. (2021):
There exists a **deterministic ODE** with the **same marginal distribution** $p_t(x)$ as the SDE!
$$dx = \left[f(x, t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)\right]dt$$
The noise term $g(t)dw$ disappears, only the drift is modified.
Probability Flow ODE for VP-SDE
$$dx = \left[-\frac{1}{2}\beta(t)x - \frac{1}{2}\beta(t)\nabla_x \log p_t(x)\right]dt$$
Substituting score with $\epsilon_\theta$:
$$dx = \left[-\frac{1}{2}\beta(t)x + \frac{\beta(t)}{2\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x, t)\right]dt$$
This is identical to DDIM with $\eta=0$!
5. SDE vs ODE: Characteristic Comparison
Sampling Paths
Mathematical Relationship
SDE ODE
┌───────────┐ ┌───────────┐
z ~ │ Reverse │ z ~ │ Probability│
N(0,I)│ SDE │ N(0,I) │ Flow ODE │
│ │ │ │
└─────┬─────┘ └─────┬─────┘
│ │
▼ ▼
x ~ p_data x ~ p_data
Same marginal distribution, different paths!DDPM vs DDIM
DDIM's $\eta$ parameter:
- $\eta = 0$: Pure ODE (deterministic)
- $\eta = 1$: Pure SDE (same as DDPM)
- $0 < \eta < 1$: Interpolation
6. Score Matching: Learning the Score
Denoising Score Matching
Learning the score function directly is difficult. Instead, we use Denoising Score Matching:
$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$
This is identical to DDPM's training objective!
Equivalence of Score and Noise Prediction
$$\text{Score: } s_\theta(x, t) \approx \nabla_x \log p_t(x)$$
$$\text{Noise: } \epsilon_\theta(x, t) \approx \epsilon$$
Relationship:
$$s_\theta = -\frac{\epsilon_\theta}{\sigma_t}$$
Thus noise prediction = score prediction (only scale differs)
7. Numerical Solvers
SDE Solvers
Euler-Maruyama (most basic):
$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t} \cdot z$$
Predictor-Corrector (Song et al.):
- Predictor: Euler step
- Corrector: Refine with Langevin dynamics
ODE Solvers
Euler (1st order):
$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t$$
Heun (2nd order):
$$\tilde{x} = x_t + f(x_t, t)\Delta t$$
$$x_{t-\Delta t} = x_t + \frac{1}{2}[f(x_t, t) + f(\tilde{x}, t-\Delta t)]\Delta t$$
DPM-Solver (specialized higher-order solver):
- Exploits structure of diffusion ODE
- High quality with 10-20 steps
Solver Comparison
8. Connection to Flow Matching
Conditional Flow Matching
Flow Matching is also ODE-based:
$$dx = v_\theta(x, t)dt$$
Differences:
- Diffusion ODE: Drift derived from score
- Flow Matching: Directly learn velocity
Same Result, Different Paths
Both transform $p_{\text{noise}} \to p_{\text{data}}$ but:
9. Practical Selection Guide
When to Use SDE?
- When diversity is important
- When sufficient compute is available
- When stochastic refinement is needed (e.g., inpainting)
When to Use ODE?
- When speed is important
- When deterministic results are needed (reproducibility)
- When latent interpolation is needed
Choices of Modern Models
10. Advanced Topics
Continuous Normalizing Flows (CNF)
From the ODE perspective, diffusion is a type of Normalizing Flow:
$$\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{div}(f(x_t, t)) dt$$
This enables likelihood computation as well.
Optimal Transport Perspective
Probability Flow ODE connects to Optimal Transport:
- "Shortest path" between two distributions
- Related to Wasserstein distance
Guidance in SDE vs ODE
Classifier-Free Guidance applies to both SDE and ODE:
$$\tilde{s}(x, t) = s(x, t) + w \cdot (s(x, t | c) - s(x, t))$$
Conclusion
Key Insight: SDE and ODE solve the same problem in different ways. Thanks to Probability Flow ODE, we can maintain the theoretical advantages of SDE while gaining the practical benefits of ODE.
References
- Song, Y., et al. "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)
- Ho, J., et al. "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
- Song, J., et al. "Denoising Diffusion Implicit Models" (ICLR 2021)
- Lipman, Y., et al. "Flow Matching for Generative Modeling" (ICLR 2023)
- Lu, C., et al. "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022)