SDE vs ODE: Mathematical Foundations of Score-based Diffusion

Stochastic vs Deterministic. Same distribution, different paths.

TL;DR

SDE (Stochastic DE): Probabilistic paths with noise, theoretical basis of DDPM
ODE (Ordinary DE): Deterministic paths, basis of DDIM and Flow Matching
Probability Flow ODE: An ODE with the same marginal distribution as SDE
Key Difference: SDE = more diversity, less speed; ODE = less diversity, more speed

1. Why Differential Equations?

The Essence of Diffusion

Diffusion models are transformations between two distributions:

Forward: Data $p_{\text{data}}$ → Noise $\mathcal{N}(0, I)$
Reverse: Noise $\mathcal{N}(0, I)$ → Data $p_{\text{data}}$

Modeling this transformation in continuous time gives us differential equations.

Discrete vs Continuous

DDPM (discrete):

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$$

Continuous-time SDE:

$$dx = f(x, t)dt + g(t)dw$$

The continuous-time view is more flexible and enables various sampler designs.

2. Forward SDE: From Data to Noise

Variance Preserving SDE (VP-SDE)

The continuous SDE corresponding to DDPM:

$$dx = -\frac{1}{2}\beta(t)x \, dt + \sqrt{\beta(t)} \, dw$$

Where:

$\beta(t)$: noise schedule (noise intensity over time)
$dw$: Wiener process (Brownian motion)

Variance Exploding SDE (VE-SDE)

The SDE corresponding to SMLD/NCSN:

$$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} \, dw$$

Where $\sigma(t)$ is the noise scale increasing over time.

Solution of Forward Process

For VP-SDE, the distribution at time $t$ is:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$

Where $\bar{\alpha}_t = e^{-\int_0^t \beta(s)ds}$

This exactly matches DDPM's forward process!

3. Reverse SDE: From Noise to Data

Anderson's Theorem

A remarkable fact: Running the forward SDE backwards in time is also an SDE!

Forward:

$$dx = f(x, t)dt + g(t)dw$$

Reverse:

$$dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$$

Where:

$\nabla_x \log p_t(x)$: Score function (the key!)
$d\bar{w}$: Reverse-time Wiener process

What is the Score Function?

$$\nabla_x \log p_t(x) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$

The score is the "gradient pointing toward data from current position."

Relationship between DDPM's noise prediction $\epsilon_\theta$ and score:

$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

4. Probability Flow ODE

The Key Discovery

A crucial finding by Song et al. (2021):

There exists a **deterministic ODE** with the **same marginal distribution** $p_t(x)$ as the SDE!

$$dx = \left[f(x, t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)\right]dt$$

The noise term $g(t)dw$ disappears, only the drift is modified.

Probability Flow ODE for VP-SDE

$$dx = \left[-\frac{1}{2}\beta(t)x - \frac{1}{2}\beta(t)\nabla_x \log p_t(x)\right]dt$$

Substituting score with $\epsilon_\theta$:

$$dx = \left[-\frac{1}{2}\beta(t)x + \frac{\beta(t)}{2\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x, t)\right]dt$$

This is identical to DDIM with $\eta=0$!

5. SDE vs ODE: Characteristic Comparison

Sampling Paths

Mathematical Relationship

          SDE                    ODE
     ┌───────────┐         ┌───────────┐
z ~  │ Reverse   │   z ~   │ Probability│
N(0,I)│ SDE      │  N(0,I) │ Flow ODE  │
     │           │         │           │
     └─────┬─────┘         └─────┬─────┘
           │                     │
           ▼                     ▼
        x ~ p_data           x ~ p_data

     Same marginal distribution, different paths!

DDPM vs DDIM

DDIM's $\eta$ parameter:

$\eta = 0$: Pure ODE (deterministic)
$\eta = 1$: Pure SDE (same as DDPM)
$0 < \eta < 1$: Interpolation

6. Score Matching: Learning the Score

Denoising Score Matching

Learning the score function directly is difficult. Instead, we use Denoising Score Matching:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

This is identical to DDPM's training objective!

Equivalence of Score and Noise Prediction

$$\text{Score: } s_\theta(x, t) \approx \nabla_x \log p_t(x)$$

$$\text{Noise: } \epsilon_\theta(x, t) \approx \epsilon$$

Relationship:

$$s_\theta = -\frac{\epsilon_\theta}{\sigma_t}$$

Thus noise prediction = score prediction (only scale differs)

7. Numerical Solvers

SDE Solvers

Euler-Maruyama (most basic):

$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t} \cdot z$$

Predictor-Corrector (Song et al.):

Predictor: Euler step
Corrector: Refine with Langevin dynamics

ODE Solvers

Euler (1st order):

$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t$$

Heun (2nd order):

$$\tilde{x} = x_t + f(x_t, t)\Delta t$$

$$x_{t-\Delta t} = x_t + \frac{1}{2}[f(x_t, t) + f(\tilde{x}, t-\Delta t)]\Delta t$$

DPM-Solver (specialized higher-order solver):

Exploits structure of diffusion ODE
High quality with 10-20 steps

Solver Comparison

8. Connection to Flow Matching

Conditional Flow Matching

Flow Matching is also ODE-based:

$$dx = v_\theta(x, t)dt$$

Differences:

Diffusion ODE: Drift derived from score
Flow Matching: Directly learn velocity

Same Result, Different Paths

Both transform $p_{\text{noise}} \to p_{\text{data}}$ but:

9. Practical Selection Guide

When to Use SDE?

When diversity is important
When sufficient compute is available
When stochastic refinement is needed (e.g., inpainting)

When to Use ODE?

When speed is important
When deterministic results are needed (reproducibility)
When latent interpolation is needed

Choices of Modern Models

10. Advanced Topics

Continuous Normalizing Flows (CNF)

From the ODE perspective, diffusion is a type of Normalizing Flow:

$$\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{div}(f(x_t, t)) dt$$

This enables likelihood computation as well.

Optimal Transport Perspective

Probability Flow ODE connects to Optimal Transport:

"Shortest path" between two distributions
Related to Wasserstein distance

Guidance in SDE vs ODE

Classifier-Free Guidance applies to both SDE and ODE:

$$\tilde{s}(x, t) = s(x, t) + w \cdot (s(x, t | c) - s(x, t))$$

Conclusion

Key Insight: SDE and ODE solve the same problem in different ways. Thanks to Probability Flow ODE, we can maintain the theoretical advantages of SDE while gaining the practical benefits of ODE.

References

Song, Y., et al. "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)
Ho, J., et al. "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
Song, J., et al. "Denoising Diffusion Implicit Models" (ICLR 2021)
Lipman, Y., et al. "Flow Matching for Generative Modeling" (ICLR 2023)
Lu, C., et al. "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022)

SDE vs ODE: Mathematical Foundations of Score-based Diffusion

Stochastic vs Deterministic. Same distribution, different paths.

TL;DR

SDE (Stochastic DE): Probabilistic paths with noise, theoretical basis of DDPM
ODE (Ordinary DE): Deterministic paths, basis of DDIM and Flow Matching
Probability Flow ODE: An ODE with the same marginal distribution as SDE
Key Difference: SDE = more diversity, less speed; ODE = less diversity, more speed

1. Why Differential Equations?

The Essence of Diffusion

Diffusion models are transformations between two distributions:

Forward: Data $p_{\text{data}}$ → Noise $\mathcal{N}(0, I)$
Reverse: Noise $\mathcal{N}(0, I)$ → Data $p_{\text{data}}$

Modeling this transformation in continuous time gives us differential equations.

Discrete vs Continuous

DDPM (discrete):

$$x_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(x_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x_t, t)\right) + \sigma_t z$$

Continuous-time SDE:

$$dx = f(x, t)dt + g(t)dw$$

The continuous-time view is more flexible and enables various sampler designs.

2. Forward SDE: From Data to Noise

Variance Preserving SDE (VP-SDE)

The continuous SDE corresponding to DDPM:

$$dx = -\frac{1}{2}\beta(t)x \, dt + \sqrt{\beta(t)} \, dw$$

Where:

$\beta(t)$: noise schedule (noise intensity over time)
$dw$: Wiener process (Brownian motion)

Variance Exploding SDE (VE-SDE)

The SDE corresponding to SMLD/NCSN:

$$dx = \sqrt{\frac{d[\sigma^2(t)]}{dt}} \, dw$$

Where $\sigma(t)$ is the noise scale increasing over time.

Solution of Forward Process

For VP-SDE, the distribution at time $t$ is:

$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$$

Where $\bar{\alpha}_t = e^{-\int_0^t \beta(s)ds}$

This exactly matches DDPM's forward process!

3. Reverse SDE: From Noise to Data

Anderson's Theorem

A remarkable fact: Running the forward SDE backwards in time is also an SDE!

Forward:

$$dx = f(x, t)dt + g(t)dw$$

Reverse:

$$dx = [f(x, t) - g(t)^2 \nabla_x \log p_t(x)]dt + g(t)d\bar{w}$$

Where:

$\nabla_x \log p_t(x)$: Score function (the key!)
$d\bar{w}$: Reverse-time Wiener process

What is the Score Function?

$$\nabla_x \log p_t(x) = -\frac{\epsilon}{\sqrt{1-\bar{\alpha}_t}}$$

The score is the "gradient pointing toward data from current position."

Relationship between DDPM's noise prediction $\epsilon_\theta$ and score:

$$s_\theta(x_t, t) = -\frac{\epsilon_\theta(x_t, t)}{\sqrt{1-\bar{\alpha}_t}}$$

4. Probability Flow ODE

The Key Discovery

A crucial finding by Song et al. (2021):

There exists a **deterministic ODE** with the **same marginal distribution** $p_t(x)$ as the SDE!

$$dx = \left[f(x, t) - \frac{1}{2}g(t)^2 \nabla_x \log p_t(x)\right]dt$$

The noise term $g(t)dw$ disappears, only the drift is modified.

Probability Flow ODE for VP-SDE

$$dx = \left[-\frac{1}{2}\beta(t)x - \frac{1}{2}\beta(t)\nabla_x \log p_t(x)\right]dt$$

Substituting score with $\epsilon_\theta$:

$$dx = \left[-\frac{1}{2}\beta(t)x + \frac{\beta(t)}{2\sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(x, t)\right]dt$$

This is identical to DDIM with $\eta=0$!

5. SDE vs ODE: Characteristic Comparison

Sampling Paths

Mathematical Relationship

          SDE                    ODE
     ┌───────────┐         ┌───────────┐
z ~  │ Reverse   │   z ~   │ Probability│
N(0,I)│ SDE      │  N(0,I) │ Flow ODE  │
     │           │         │           │
     └─────┬─────┘         └─────┬─────┘
           │                     │
           ▼                     ▼
        x ~ p_data           x ~ p_data

     Same marginal distribution, different paths!

DDPM vs DDIM

DDIM's $\eta$ parameter:

$\eta = 0$: Pure ODE (deterministic)
$\eta = 1$: Pure SDE (same as DDPM)
$0 < \eta < 1$: Interpolation

6. Score Matching: Learning the Score

Denoising Score Matching

Learning the score function directly is difficult. Instead, we use Denoising Score Matching:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon}\left[\|\epsilon - \epsilon_\theta(x_t, t)\|^2\right]$$

This is identical to DDPM's training objective!

Equivalence of Score and Noise Prediction

$$\text{Score: } s_\theta(x, t) \approx \nabla_x \log p_t(x)$$

$$\text{Noise: } \epsilon_\theta(x, t) \approx \epsilon$$

Relationship:

$$s_\theta = -\frac{\epsilon_\theta}{\sigma_t}$$

Thus noise prediction = score prediction (only scale differs)

7. Numerical Solvers

SDE Solvers

Euler-Maruyama (most basic):

$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t + g(t)\sqrt{\Delta t} \cdot z$$

Predictor-Corrector (Song et al.):

Predictor: Euler step
Corrector: Refine with Langevin dynamics

ODE Solvers

Euler (1st order):

$$x_{t-\Delta t} = x_t + f(x_t, t)\Delta t$$

Heun (2nd order):

$$\tilde{x} = x_t + f(x_t, t)\Delta t$$

$$x_{t-\Delta t} = x_t + \frac{1}{2}[f(x_t, t) + f(\tilde{x}, t-\Delta t)]\Delta t$$

DPM-Solver (specialized higher-order solver):

Exploits structure of diffusion ODE
High quality with 10-20 steps

Solver Comparison

8. Connection to Flow Matching

Conditional Flow Matching

Flow Matching is also ODE-based:

$$dx = v_\theta(x, t)dt$$

Differences:

Diffusion ODE: Drift derived from score
Flow Matching: Directly learn velocity

Same Result, Different Paths

Both transform $p_{\text{noise}} \to p_{\text{data}}$ but:

9. Practical Selection Guide

When to Use SDE?

When diversity is important
When sufficient compute is available
When stochastic refinement is needed (e.g., inpainting)

When to Use ODE?

When speed is important
When deterministic results are needed (reproducibility)
When latent interpolation is needed

Choices of Modern Models

10. Advanced Topics

Continuous Normalizing Flows (CNF)

From the ODE perspective, diffusion is a type of Normalizing Flow:

$$\log p_0(x_0) = \log p_T(x_T) - \int_0^T \text{div}(f(x_t, t)) dt$$

This enables likelihood computation as well.

Optimal Transport Perspective

Probability Flow ODE connects to Optimal Transport:

"Shortest path" between two distributions
Related to Wasserstein distance

Guidance in SDE vs ODE

Classifier-Free Guidance applies to both SDE and ODE:

$$\tilde{s}(x, t) = s(x, t) + w \cdot (s(x, t | c) - s(x, t))$$

Conclusion

Key Insight: SDE and ODE solve the same problem in different ways. Thanks to Probability Flow ODE, we can maintain the theoretical advantages of SDE while gaining the practical benefits of ODE.

References

Song, Y., et al. "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021)
Ho, J., et al. "Denoising Diffusion Probabilistic Models" (NeurIPS 2020)
Song, J., et al. "Denoising Diffusion Implicit Models" (ICLR 2021)
Lipman, Y., et al. "Flow Matching for Generative Modeling" (ICLR 2023)
Lu, C., et al. "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling" (NeurIPS 2022)