Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion

Diffusion LLM Part 3: LLaDA -- Building an 8B LLM with Masked Diffusion
In Part 2, we explored how D3PM and MDLM define Diffusion in discrete spaces. We also confirmed that Absorbing State Diffusion using [MASK] tokens is the most effective approach for text.
However, prior work remained at relatively small scales. The question "Can we actually build a real LLM with Diffusion?" was answered by LLaDA (Large Language Diffusion with mAsking).
Nie et al. (2025) scaled Masked Diffusion to 8B parameters, directly compared it against LLaMA3 8B, and demonstrated that Diffusion LLMs can possess the core capabilities of AR models -- In-Context Learning and Instruction Following.
Core Idea: Variable Masking Ratio
The most important design decision in LLaDA is the variable masking ratio.
BERT masks a fixed 15% of the input during training. Once set, this ratio never changes.
LLaDA randomly samples the masking ratio from anywhere between 0% and 100% during training. In some batches, only 5% is masked; in others, 95% is masked.
Here is why this is critically important:
In-Context Learning: When the masking ratio is very low (e.g., 5%), the model predicts the remaining tokens while most tokens are already visible. This is essentially a "read the given context and fill in the blanks" task, which naturally connects to In-Context Learning.
Fisher Consistency: Variable masking ratio satisfies Fisher consistency with respect to the data distribution. Theoretically, given sufficient data and model capacity, it is guaranteed to recover the true data distribution. BERT's fixed ratio provides no such guarantee.
Scaling Effect: In scaling experiments, LLaDA exhibits nearly identical scaling laws to AR models (ARM). As you scale up the model, performance improves predictably.
Architecture
The architecture of LLaDA 8B is intentionally simple. It uses the Transformer structure almost as-is.
Notable differences:
No GQA: LLaMA3 uses Grouped Query Attention to improve KV-cache efficiency, but LLaDA uses full multi-head attention. Since Diffusion models don't use KV-cache, GQA is unnecessary. This is not just a design simplification -- it is a structural advantage. In AR models, the KV-cache consumes GPU memory proportional to sequence length: the longer the context, the larger the cache, and this becomes the memory bottleneck for long-context inference. LLaDA has no KV-cache at all, so there is no memory explosion as sequence length grows. This is one of the potential advantages Diffusion models hold for long-context scenarios.
FFN Size Difference: Because LLaDA uses full attention instead of GQA, it has more attention parameters. To keep the total parameter count comparable, the FFN dimension was reduced (14336 -> 12288).
Mask Token: A [MASK] token (ID 126336) is added to the vocabulary. This is used in the Diffusion forward process.
No Time Embedding: Image Diffusion models (like U-Net) typically inject timestep t as a separate embedding. LLaDA does not use this. The masking ratio itself implicitly encodes the time information -- many [MASK] tokens indicate an early step, few indicate a late step.
Training Pipeline
Pre-training: Pre-trained on 2.3 trillion tokens. Required 0.13M H800 GPU hours. During training, a crash occurred at the 1.2T token mark, which was resolved by lowering the learning rate from 4e-4 to 1e-4.
SFT (Supervised Fine-Tuning): After pre-training, the model is fine-tuned on instruction-following data. This stage applies the Semi-Autoregressive Remasking strategy.
Semi-Autoregressive Remasking: The sequence is divided into multiple blocks. Blocks are generated left-to-right sequentially, but within each block, tokens are generated in parallel via the Diffusion reverse process.
[Block 1: Diffusion] -> [Block 2: Diffusion] -> [Block 3: Diffusion]
This hybrid approach sits between fully AR and fully Diffusion, offering a practical trade-off.
Variable Length Handling: 1% of the pre-training data uses randomly sampled lengths in the [1, 4096] range. This helps the model learn to handle sequences of varying lengths.
Can It Avoid the Reversal Curse?
The Reversal Curse mentioned in Part 1 -- where a model trained on "A is B" fails to infer "B is A" -- is a structural limitation of AR models.
Why LLaDA is free from this problem:
AR Models: Because P(x) = P(x_1) * P(x_2|x_1) * ..., they only learn conditional probabilities in the "A -> B" direction. Without separate training, the "B -> A" direction remains unknown.
LLaDA: During training, any position in the sequence can become [MASK]. The model might predict the middle from "A [MASK] B", predict A from "[MASK] is B", or predict B from "A is [MASK]". Bidirectional relationships are learned naturally.
In fact, the LLaDA project page showcases a Reversal Curse demo, presenting cases where LLaDA correctly answers reverse-direction questions that AR models fail on.
That said, this works because masking provides bidirectional context, and it would be premature to claim the Reversal Curse is fully solved. Bidirectional context and bidirectional knowledge inference may be different problems.
In-Context Learning
One of the most remarkable capabilities of LLMs is In-Context Learning (ICL) -- the ability to perform new tasks simply by including a few examples in the prompt, without any additional fine-tuning.
Is this possible with Diffusion models? LLaDA answers "yes."
Variable masking ratio is the key. When the masking ratio is very low during training (most tokens are visible), the model is essentially performing the task of "understanding the given context and predicting the rest." This is the learning mechanism behind ICL.
In practice, LLaDA 8B shows comparable performance to LLaMA3 8B across several ICL benchmarks. It achieves particularly competitive results in few-shot settings.
This is an important finding. It suggests that ICL is not an ability inherent to next-token prediction, but rather a capability that any language model at sufficient scale can acquire regardless of the training paradigm.
Benchmark Results
LLaDA 8B vs LLaMA3 8B (custom AR baseline):
Key results highlighted in the paper:
Scaling Comparison: In the 10^18 to 10^23 FLOPs range, the scaling curves of LLaDA and ARM are remarkably similar. Diffusion LLMs follow the same scaling laws as AR models.
ICL Capability: ICL, previously thought to be an inherent ability of AR models, also emerges in Diffusion models.
SFT Effectiveness: Instruction following is also effectively learned through SFT. Being a Diffusion model does not prevent SFT from working.
Limitations and Challenges
Limitations acknowledged by the LLaDA paper:
Inference Speed: Without KV-cache optimization, inference is slower than AR models. Fixed context length and multiple denoising steps are required. This is a core challenge that LLaDA 2.0/2.1 aims to address.
Inference Hyperparameter Sensitivity: Performance varies depending on the number of sampling steps, remasking strategy, etc. AR models only need to tune temperature and top-p, but Diffusion has additional hyperparameters.
No RL Applied: At the time of the paper, alignment training such as RLHF had not been applied. The RL framework for Diffusion models was still in the research stage (this is also addressed in LLaDA 2.1).
FLOPs Limitation: The direct comparison with ARM was limited to below 10^23 FLOPs. Comparison at larger scales remains a future research topic.
Significance of LLaDA
What LLaDA proved:
- Diffusion LLMs follow the same scaling laws as AR models
- Core AR model capabilities like ICL and Instruction Following also emerge in Diffusion
- Bidirectional context utilization can mitigate structural issues like the Reversal Curse
- Diffusion LLMs can be built with minimal modifications to the Transformer
This answers "no" to the fundamental question: "Are the core capabilities of LLMs inherent to the AR paradigm?" With sufficient scale and the right training strategy, powerful language models can be built regardless of the generation paradigm.
In Part 4, we will cover the journey of turning this possibility into reality -- LLaDA 2.0's 100B scaling and LLaDA 2.1's Token Editing innovation.
Key Takeaways
References
- Nie et al. "Large Language Diffusion Models." arXiv:2502.09992, 2025.
- Touvron et al. "LLaMA: Open and Efficient Foundation Language Models." arXiv:2302.13971, 2023.
- Sahoo et al. "Simple and Effective Masked Diffusion Language Models." NeurIPS 2024.
- Berglund et al. "The Reversal Curse: LLMs trained on 'A is B' fail to learn 'B is A'." ICLR 2024.