Diffusion LLM Part 2: Discrete Diffusion -- How to Add Noise to Text
D3PM, Transition Matrices, Absorbing States, MDLM -- how to bring diffusion from continuous space to discrete tokens.

Diffusion LLM Part 2: Discrete Diffusion -- How Do You Add Noise to Text?
In Part 1, we explored the principles of Diffusion operating in continuous space. Adding Gaussian noise to image pixels is natural, but text tokens are discrete data. What happens if you add noise of 0.3 to "hello"?
In this post, we cover how to bring Diffusion into discrete space. Starting from D3PM's Transition Matrix and arriving at MDLM's Masked Diffusion -- the direct ancestors of LLaDA.
D3PM: Diffusion in Discrete Space
Austin et al. (2021) raise a fundamental question in D3PM (Discrete Denoising Diffusion Probabilistic Models): how do you define a forward process for discrete data where you can't add Gaussian noise?
The answer: use a Transition Matrix.
In continuous Diffusion, Gaussian noise plays a central role. In discrete Diffusion, a transition matrix Q_t takes its place. At each step t, the probability of token x_{t-1} changing to x_t is defined by a matrix:
q(x_t | x_{t-1}) = Cat(x_t; p = x_{t-1} * Q_t)
Here, Cat is the Categorical distribution, and Q_t is a K x K matrix (where K is the vocabulary size). Q_t[i][j] represents the probability of token i changing to token j.
Correspondence with continuous Diffusion:
Related Posts

InternVL-U: Understanding + Generation + Editing in One 4B Model -- A New Standard for Unified Multimodal AI
Shanghai AI Lab's InternVL-U. A single 4B parameter model handles image understanding, generation, editing, and reasoning-based generation. Decoupled visual representations outperform 14B BAGEL on GenEval and DPG-Bench.

Hybrid Mamba-Transformer MoE: Three Teams, One Architecture -- The 2026 LLM Convergence
NVIDIA Nemotron 3 Nano, Qwen 3.5, and Mamba-3 independently converge on 75% linear layers + 25% attention + MoE. 88% KV-cache reduction, O(n) complexity for long-context processing.

Spectrum: 3-5x Diffusion Speedup Without Any Training -- The Power of Chebyshev Polynomials
CVPR 2026 paper from Stanford/ByteDance. Chebyshev polynomial feature forecasting achieves 4.79x speedup on FLUX.1, 4.56x on HunyuanVideo. Training-free, instantly applicable to any model.