Diffusion LLM Part 2: Discrete Diffusion -- How to Add Noise to Text

Diffusion LLM Part 2: Discrete Diffusion -- How Do You Add Noise to Text?
In Part 1, we explored the principles of Diffusion operating in continuous space. Adding Gaussian noise to image pixels is natural, but text tokens are discrete data. What happens if you add noise of 0.3 to "hello"?
In this post, we cover how to bring Diffusion into discrete space. Starting from D3PM's Transition Matrix and arriving at MDLM's Masked Diffusion -- the direct ancestors of LLaDA.
D3PM: Diffusion in Discrete Space
Austin et al. (2021) raise a fundamental question in D3PM (Discrete Denoising Diffusion Probabilistic Models): how do you define a forward process for discrete data where you can't add Gaussian noise?
The answer: use a Transition Matrix.
In continuous Diffusion, Gaussian noise plays a central role. In discrete Diffusion, a transition matrix Q_t takes its place. At each step t, the probability of token x_{t-1} changing to x_t is defined by a matrix:
q(x_t | x_{t-1}) = Cat(x_t; p = x_{t-1} * Q_t)
Here, Cat is the Categorical distribution, and Q_t is a K x K matrix (where K is the vocabulary size). Q_t[i][j] represents the probability of token i changing to token j.
Correspondence with continuous Diffusion:
Three Choices for the Transition Matrix
D3PM proposes several ways to design Q_t.
Uniform Transition: every token changes to any other token with equal probability.
Q_t[i][j] = (1 - beta_t) if i == j, beta_t / (K-1) otherwise
This is the closest analogue to the isotropic Gaussian in continuous Diffusion. However, it is inefficient for text with tens of thousands of vocabulary entries. Having "cat" randomly turn into "the" or "quantum" provides no useful learning signal.
Gaussian-like Transition: tokens that are close in embedding space have a higher probability of transitioning to each other. This mimics the local nature of noise in continuous space within the discrete domain.
Absorbing State Transition: every token transitions to a single special "absorbing state." That absorbing state is the [MASK] token.
Q_t[i][MASK] = beta_t, Q_t[i][i] = 1 - beta_t, Q_t[i][j] = 0 (j != i, j != MASK)
At each step, a token either keeps its original value (with probability 1-beta_t) or becomes [MASK] (with probability beta_t). Once a token becomes [MASK], it never comes back. After enough steps, every token is [MASK].
This third approach works best for text. And it bears a striking resemblance to BERT's Masked Language Modeling.
The Connection Between [MASK] and BERT
Let's revisit the absorbing state transition:
- Forward: tokens progressively become [MASK]
- Reverse: predict and restore original tokens at [MASK] positions
This is nearly identical to how BERT is trained. BERT also masks 15% of the input with [MASK] and trains the model to predict the original tokens.
But there is a crucial difference:
BERT is an encoder that learns representations with a fixed 15% masking rate. Absorbing state Diffusion is a generative model that continuously varies the masking ratio while optimizing the likelihood.
Why this difference matters will be covered in detail in Part 3 with LLaDA.
MDLM: Simplifying Masked Diffusion
Sahoo et al. (2024)'s MDLM (Simple and Effective Masked Diffusion Language Models) optimized D3PM's absorbing state approach for text. The key contribution is simplification.
Continuous-time formulation: instead of D3PM's T discrete steps, MDLM defines the forward process in continuous time t (between 0 and 1). The probability that a token becomes [MASK] at time t is:
P(x_t = MASK | x_0 != MASK) = 1 - alpha(t)
alpha(t) is a noise schedule that decreases from 1 to 0. At t=0, all tokens are original; at t=1, all tokens are [MASK].
Simple training objective: MDLM's loss is surprisingly straightforward.
L = E_{t, x_0}[ -alpha'(t)/alpha(t) * sum_{i: x_t^i = MASK} log p_theta(x_0^i | x_t) ]
In plain terms: "a cross-entropy loss for predicting the original token at masked positions, weighted by the noise schedule."
Compared to BERT's MLM loss, it is essentially the same form with an added weight. And this satisfies the Diffusion ELBO.
The Core Insight: Why Masking Is Diffusion
Let's lay out why Masked Diffusion qualifies as "real" Diffusion.
Forward process exists: starting from the original text, [MASK] tokens gradually increase. As t grows, information is lost. The final state (t=1) is a fully [MASK]ed sequence, equivalent to pure noise.
Reverse process is learned: a neural network predicts original tokens at [MASK] positions, restoring information. This is performed gradually over multiple steps.
Theoretical guarantee: the training objective is a variational lower bound (ELBO) on the data log-likelihood.
Just as "clean image -> Gaussian noise" is the forward process in continuous Diffusion, "intact text -> all tokens are [MASK]" is the forward process in Masked Diffusion.
The Text Generation Process
Generating text with Masked Diffusion is intuitive:
- Start with a sequence where every position is [MASK]: [MASK] [MASK] [MASK] [MASK] [MASK]
- The model predicts a probability distribution over original tokens for each [MASK] position
- Fill in tokens starting from the positions with the highest confidence
- Re-predict for positions that are still [MASK]
- Repeat until every [MASK] has been replaced with an actual token
Step 1: [MASK] [MASK] [MASK] [MASK] [MASK]
Step 2: [MASK] [MASK] is [MASK] [MASK]
Step 3: The [MASK] is [MASK] [MASK]
Step 4: The cat is on [MASK]
Step 5: The cat is on mat
During this process, the model leverages bidirectional context. At Step 3, since "is" is already known, it can reference both "The" on the left and "is" on the right to fill in the blank. AR models cannot do this.
The Decisive Difference from Autoregressive
The Road to LLaDA
Summarizing the progression from D3PM to MDLM:
D3PM (2021): "We can define discrete Diffusion using Transition Matrices. In particular, the Absorbing State ([MASK]) approach is effective for text."
MDLM (2024): "Simplifying with continuous-time makes training more stable, and the loss takes a form similar to BERT's MLM."
LLaDA (2025): "Scaling this principle to 8B parameters achieves real LLM-level performance. The variable masking ratio is the key."
In Part 3, we will cover how LLaDA scaled Masked Diffusion to 8B, along with direct comparison results against AR models.
Key Takeaways
References
- Austin et al. "Structured Denoising Diffusion Models in Discrete State-Spaces." NeurIPS 2021.
- Sahoo et al. "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.
- Lou et al. "Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution." ICML 2024.
- Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." NAACL 2019.