DDPM Math Walkthrough: Deriving Forward/Reverse Process Step by Step

DDPM: The Beginning of Diffusion Models - Images Born from Noise
TL;DR: DDPM (Denoising Diffusion Probabilistic Model) progressively adds noise to images, then learns to reverse this process to generate images from noise. It's a revolutionary methodology that is mathematically rigorous while producing high-quality images.
1. What is a Diffusion Model?
1.1 History of Generative Models
The history of deep learning-based image generation:
1.2 Core Idea
The intuition behind DDPM:
"If we learn the process of turning an image into pure noise, we can reverse that process to generate images from noise"
Forward Process (Adding Noise):
x₀ (original image) → x₁ → x₂ → ... → x_T (pure noise)

Reverse Process (Removing Noise):
x_T (pure noise) → x_{T-1} → ... → x₁ → x₀ (generated image)
1.3 Why "Diffusion"?
Physical Analogy: Think of ink diffusing in water.
- Forward: Ink spreads in water becoming uniform → Image becomes noise
- Reverse: Reversing this process → Image coalesces from noise
Probabilistic Interpretation: Connecting complex data distribution $p_{data}(x)$ to simple distribution $\mathcal{N}(0, I)$
2. Mathematical Foundations
2.1 Forward Process (Diffusion)
Starting from data $x_0$, progressively add noise:
$$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t} x_{t-1}, \beta_t I)$$
Where:
$\beta_t$: variance schedule (typically$\beta_1 = 10^{-4}$to$\beta_T = 0.02$)$T$: total diffusion steps (typically 1000)
Full forward process:
$$q(x_{1:T} | x_0) = \prod_{t=1}^{T} q(x_t | x_{t-1})$$
2.2 Key Insight: Direct Jump to Any Timestep
Defining $\alpha_t = 1 - \beta_t$, $\bar{\alpha}_t = \prod_{s=1}^{t} \alpha_s$:
$$q(x_t | x_0) = \mathcal{N}(x_t; \sqrt{\bar{\alpha}_t} x_0, (1 - \bar{\alpha}_t) I)$$
Meaning: We can go directly from $x_0$ to $x_t$!
$$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$$
2.3 Reverse Process (Denoising)
Reversing the forward process:
$$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$
Key Question: How do we learn $\mu_\theta$ and $\Sigma_\theta$?
2.4 ELBO (Evidence Lower Bound)
Variational bound on log likelihood:
$$\log p_\theta(x_0) \geq \mathbb{E}_{q} \left[ \log \frac{p_\theta(x_{0:T})}{q(x_{1:T}|x_0)} \right]$$
Expanding this:
$$\mathcal{L} = \underbrace{\mathbb{E}_q[\log p_\theta(x_0|x_1)]}_{\text{reconstruction}} - \underbrace{D_{KL}(q(x_T|x_0) || p(x_T))}_{\text{prior matching}} - \underbrace{\sum_{t=2}^{T} \mathbb{E}_q[D_{KL}(q(x_{t-1}|x_t,x_0) || p_\theta(x_{t-1}|x_t))]}_{\text{denoising matching}}$$
2.5 Posterior Computation
Using Bayes' rule:
$$q(x_{t-1} | x_t, x_0) = \mathcal{N}(x_{t-1}; \tilde{\mu}_t(x_t, x_0), \tilde{\beta}_t I)$$
Where:
$$\tilde{\mu}_t(x_t, x_0) = \frac{\sqrt{\bar{\alpha}_{t-1}} \beta_t}{1 - \bar{\alpha}_t} x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha}_{t-1})}{1 - \bar{\alpha}_t} x_t$$
$$\tilde{\beta}_t = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$$
3. Reparameterization to Noise Prediction
3.1 Key Insight
Since $x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon$:
$$x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha}_t} \epsilon}{\sqrt{\bar{\alpha}_t}}$$
Therefore, instead of predicting $x_0$, predict $\epsilon$:
$$\tilde{\mu}_t = \frac{1}{\sqrt{\alpha_t}} \left( x_t - \frac{\beta_t}{\sqrt{1 - \bar{\alpha}_t}} \epsilon_\theta(x_t, t) \right)$$
3.2 Simplified Loss
The simple loss proposed by Ho et al.:
$$\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon} \left[ || \epsilon - \epsilon_\theta(x_t, t) ||^2 \right]$$
Interpretation: Train the network to predict the added noise
3.3 Training Algorithm
def train_step(model, x_0):
# 1. Sample random timestep
t = torch.randint(1, T+1, (batch_size,))
# 2. Sample noise
epsilon = torch.randn_like(x_0)
# 3. Compute x_t (add noise)
alpha_bar_t = get_alpha_bar(t)
x_t = sqrt(alpha_bar_t) * x_0 + sqrt(1 - alpha_bar_t) * epsilon
# 4. Predict noise
epsilon_pred = model(x_t, t)
# 5. Compute loss
loss = F.mse_loss(epsilon_pred, epsilon)
return loss4. Sampling Algorithm
4.1 Basic Sampling
@torch.no_grad()
def sample(model, shape):
# x_T ~ N(0, I)
x = torch.randn(shape)
for t in reversed(range(1, T+1)):
# Predict noise
epsilon_pred = model(x, t)
# Compute μ_θ
alpha_t = get_alpha(t)
alpha_bar_t = get_alpha_bar(t)
beta_t = get_beta(t)
mu = (1 / sqrt(alpha_t)) * (x - (beta_t / sqrt(1 - alpha_bar_t)) * epsilon_pred)
# Add noise (only when t > 1)
if t > 1:
sigma = sqrt(beta_t)
x = mu + sigma * torch.randn_like(x)
else:
x = mu
return x4.2 Variance Schedule
Linear Schedule (Original DDPM):
$$\beta_t = \beta_1 + \frac{t-1}{T-1}(\beta_T - \beta_1)$$
Cosine Schedule (Improved DDPM):
$$\bar{\alpha}_t = \frac{f(t)}{f(0)}, \quad f(t) = \cos\left(\frac{t/T + s}{1 + s} \cdot \frac{\pi}{2}\right)^2$$
def cosine_beta_schedule(timesteps, s=0.008):
steps = timesteps + 1
x = torch.linspace(0, timesteps, steps)
alphas_cumprod = torch.cos(((x / timesteps) + s) / (1 + s) * torch.pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])
return torch.clip(betas, 0.0001, 0.9999)5. U-Net Architecture
5.1 Overall Structure
The noise prediction network $\epsilon_\theta$ in DDPM uses a U-Net structure:
5.2 Time Embedding
Injecting timestep $t$ into the network:
class SinusoidalPositionEmbeddings(nn.Module):
def __init__(self, dim):
super().__init__()
self.dim = dim
def forward(self, t):
half_dim = self.dim // 2
embeddings = math.log(10000) / (half_dim - 1)
embeddings = torch.exp(torch.arange(half_dim) * -embeddings)
embeddings = t[:, None] * embeddings[None, :]
embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
return embeddings5.3 ResNet Block with Time Conditioning
class ResBlock(nn.Module):
def __init__(self, in_channels, out_channels, time_emb_dim):
super().__init__()
self.time_mlp = nn.Linear(time_emb_dim, out_channels)
self.conv1 = nn.Conv2d(in_channels, out_channels, 3, padding=1)
self.conv2 = nn.Conv2d(out_channels, out_channels, 3, padding=1)
self.norm1 = nn.GroupNorm(8, out_channels)
self.norm2 = nn.GroupNorm(8, out_channels)
if in_channels != out_channels:
self.shortcut = nn.Conv2d(in_channels, out_channels, 1)
else:
self.shortcut = nn.Identity()
def forward(self, x, t_emb):
h = self.conv1(x)
h = self.norm1(h)
h = h + self.time_mlp(t_emb)[:, :, None, None] # Time conditioning
h = F.silu(h)
h = self.conv2(h)
h = self.norm2(h)
h = F.silu(h)
return h + self.shortcut(x)5.4 Self-Attention in U-Net
class SelfAttention(nn.Module):
def __init__(self, channels, num_heads=4):
super().__init__()
self.mha = nn.MultiheadAttention(channels, num_heads, batch_first=True)
self.ln = nn.LayerNorm(channels)
def forward(self, x):
b, c, h, w = x.shape
x = x.view(b, c, h*w).transpose(1, 2) # (B, H*W, C)
x_ln = self.ln(x)
attn_out, _ = self.mha(x_ln, x_ln, x_ln)
x = x + attn_out
return x.transpose(1, 2).view(b, c, h, w)6. Complete Implementation
6.1 Full U-Net Code
class UNet(nn.Module):
def __init__(self, in_channels=3, model_channels=64, out_channels=3,
channel_mult=(1, 2, 4, 8), attention_resolutions=(16, 8)):
super().__init__()
self.time_embed = nn.Sequential(
SinusoidalPositionEmbeddings(model_channels),
nn.Linear(model_channels, model_channels * 4),
nn.SiLU(),
nn.Linear(model_channels * 4, model_channels * 4),
)
# Encoder
self.encoder = nn.ModuleList()
ch = model_channels
for level, mult in enumerate(channel_mult):
for _ in range(2):
self.encoder.append(ResBlock(ch, model_channels * mult, model_channels * 4))
ch = model_channels * mult
if level != len(channel_mult) - 1:
self.encoder.append(Downsample(ch))
# Bottleneck
self.bottleneck = nn.Sequential(
ResBlock(ch, ch, model_channels * 4),
SelfAttention(ch),
ResBlock(ch, ch, model_channels * 4),
)
# Decoder
self.decoder = nn.ModuleList()
for level, mult in reversed(list(enumerate(channel_mult))):
for i in range(3):
skip_ch = ch if i == 0 else model_channels * mult
self.decoder.append(ResBlock(ch + skip_ch, model_channels * mult, model_channels * 4))
ch = model_channels * mult
if level != 0:
self.decoder.append(Upsample(ch))
self.out = nn.Sequential(
nn.GroupNorm(8, ch),
nn.SiLU(),
nn.Conv2d(ch, out_channels, 3, padding=1),
)
def forward(self, x, t):
t_emb = self.time_embed(t)
# Encoder with skip connections
skips = []
for module in self.encoder:
if isinstance(module, ResBlock):
x = module(x, t_emb)
skips.append(x)
else: # Downsample
x = module(x)
# Bottleneck
x = self.bottleneck[0](x, t_emb)
x = self.bottleneck[1](x)
x = self.bottleneck[2](x, t_emb)
# Decoder with skip connections
for module in self.decoder:
if isinstance(module, ResBlock):
x = torch.cat([x, skips.pop()], dim=1)
x = module(x, t_emb)
else: # Upsample
x = module(x)
return self.out(x)6.2 DDPM Class
class DDPM:
def __init__(self, model, T=1000, beta_start=1e-4, beta_end=0.02):
self.model = model
self.T = T
# Variance schedule
self.betas = torch.linspace(beta_start, beta_end, T)
self.alphas = 1 - self.betas
self.alpha_bars = torch.cumprod(self.alphas, dim=0)
def get_loss(self, x_0):
batch_size = x_0.shape[0]
t = torch.randint(1, self.T + 1, (batch_size,))
epsilon = torch.randn_like(x_0)
alpha_bar = self.alpha_bars[t - 1].view(-1, 1, 1, 1)
x_t = torch.sqrt(alpha_bar) * x_0 + torch.sqrt(1 - alpha_bar) * epsilon
epsilon_pred = self.model(x_t, t)
return F.mse_loss(epsilon_pred, epsilon)
@torch.no_grad()
def sample(self, shape, device):
x = torch.randn(shape, device=device)
for t in tqdm(reversed(range(1, self.T + 1))):
t_batch = torch.full((shape[0],), t, device=device)
epsilon_pred = self.model(x, t_batch)
alpha = self.alphas[t - 1]
alpha_bar = self.alpha_bars[t - 1]
beta = self.betas[t - 1]
mu = (1 / torch.sqrt(alpha)) * (x - (beta / torch.sqrt(1 - alpha_bar)) * epsilon_pred)
if t > 1:
sigma = torch.sqrt(beta)
x = mu + sigma * torch.randn_like(x)
else:
x = mu
return x7. Experimental Results
7.1 CIFAR-10 Benchmark
DDPM achieves overwhelmingly better FID!
7.2 Sample Quality
Characteristics of DDPM-generated images:
- Diversity: Diverse samples without mode collapse
- Detail: Sharp details even at high resolution
- Stability: Stable training
7.3 Drawback: Sampling Speed
Very slow due to 1000 steps → Solved by DDIM
8. Significance and Limitations of DDPM
8.1 Revolutionary Contributions
- Theoretical Foundation: Probabilistically rigorous framework
- Training Stability: No adversarial training like GAN
- Sample Quality: Achieved SOTA FID
- Diversity: No mode collapse
8.2 Limitations
- Slow Sampling: Requires 1000 steps
- High Resolution Difficult: Operates directly in pixel space
- Conditional Generation Difficult: Base model is unconditional
8.3 Follow-up Research Directions
9. Code Execution Examples
9.1 Training
# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = UNet().to(device)
ddpm = DDPM(model)
optimizer = torch.optim.Adam(model.parameters(), lr=2e-4)
# Training loop
for epoch in range(100):
for batch in dataloader:
x_0 = batch[0].to(device)
loss = ddpm.get_loss(x_0)
optimizer.zero_grad()
loss.backward()
optimizer.step()
print(f"Epoch {epoch}: Loss = {loss.item():.4f}")9.2 Sampling
# Generate samples
samples = ddpm.sample(shape=(16, 3, 32, 32), device=device)
# Save images
save_image(samples, 'samples.png', nrow=4, normalize=True)10. Conclusion
DDPM changed the paradigm of generative models:
- Simple idea of adding/removing noise
- Probabilistically rigorous framework
- Image quality surpassing GAN
- Stable training
However, there's a critical drawback of 1000-step sampling. In the next article, we'll cover DDIM which reduces this to 50 steps.
References
- Ho, J., Jain, A., & Abbeel, P. (2020). Denoising Diffusion Probabilistic Models. NeurIPS 2020
- Sohl-Dickstein, J., et al. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. ICML 2015
- Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019
- Nichol, A., & Dhariwal, P. (2021). Improved Denoising Diffusion Probabilistic Models. ICML 2021
Tags: #DDPM #Diffusion #Generative-Models #Deep-Learning #Image-Generation #U-Net #Denoising
The complete code for this article is available in the attached Jupyter Notebook.